Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Codex Agent Guide

## Repo Summary
- Offline emergency knowledge assistant built in C++ with llama.cpp.
- Docker Compose runs `tika`, `jic` (server), and `ingestion`.
- Documents live under `public/sources` and are ingested into a vector index.

## Key Paths
- `docker-compose.yml`: primary model configuration and service wiring.
- `src/`: C++ server and ingestion pipeline.
- `public/sources/`: PDFs and other content to ingest.
- `helper-scripts/`: scripts for model fetching and basic testing.

## Local Setup
1. Install Docker + Docker Compose.
2. Download GGUF files into `./gguf_models/` (see `helper-scripts/fetch-models.sh`).
3. Run: `docker compose up --build`.

## Tests (Local)
- Config sanity: `./helper-scripts/test-config.sh`
- Server smoke test (server must be running): `./helper-scripts/test-server.sh`

## Remote / CI Guidance
- Minimal CI should run `./helper-scripts/test-config.sh`.
- Optional CI step (if services can be started): `docker compose up -d` then `./helper-scripts/test-server.sh`, then `docker compose down`.

## Data Notes
- Keep licensing and provenance clear for all added sources.
- Prefer public-domain or explicitly redistributable materials.
20 changes: 10 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,7 @@

### [Learn More - CompanionIntelligence.com/JIC](https://companionintelligence.com/JIC)

If the Internet goes dark, you should still be able to quickly find the knowledge that can help you survive and thrive during a crisis. The goal of the JIC project is to deliver a self-contained LLM-powered / AI driven conversational search engine over a curated set of emergency survival pdfs that can be used without the Internet.

This includes survival guides, medical references, even agricultural know-how and engineering resources, as well as broader educational materials like offline Wikipedia, open textbooks, art and humanities. If the Internet goes dark, you should still be able to quickly find the knowledge that can help you survive and thrive during a crisis.
If the Internet goes dark, you should still be able to quickly find the knowledge that can help you survive and thrive. JIC aims to deliver a self-contained LLM-powered conversational search engine over curated emergency PDFs that can be used fully offline. The corpus includes survival guides, medical references, agricultural know-how, engineering resources, and broader educational materials like offline Wikipedia and open textbooks.


Please feel free to join us. This is a work in progress and we welcome your participation.
Expand All @@ -27,7 +25,7 @@ The problem is that these services are often cloud based, and not always at our

The fact is that there's a real difference between the difficulty of trying to read through a book on how to treat a burn during an emergency, versus getting some quick help or counsel, conversationally, right away, or at least getting some quick guidance on where to look for more details. What would you do if the internet went down? Or even just an extended power outage? What is your families plan for region-specific threats such as tornadoes, tsunamis, or forest fires? Many of us have some kind of plan; a flashlight in a drawer, extra food supplies, water, cash, a map of community resources, a muster-point.

The world has changed, we now heavily rely on tools such as ChatGPT, Claude, Google and other online resources. Even for small questions such as "how do you wire up rechargeable batteries to a solar panel?" or "what is the probable cause of a lower right side stomach pain?". The thing most of us rely heavily on information itself, and that information is not always at our fingertips.
The world has changed, and we now rely on these tools for rapid answers. The problem is that information itself is not always at our fingertips when connectivity fails.

Validating a tool like this raises many questions. Who are typical users of the dataset? What are typical scenarios? Can we build a list of typical questions a user may ask of the dataset? Can we have regression tests against the ability of the dataset to resolve the queries? Are there differences in what is needed for short, medium or extended emergencies or extended survival situations? In this ongoing project we'll try to tackle these and improve this over time.

Expand Down Expand Up @@ -77,14 +75,14 @@ We've tried a variety of approaches, ollama, python, n8n - our current stack is

| Component | Role |
|-------------------|-------------------------------------------------|
| 🧠 `Llama.cpp` | LLM loader (e.g. `llama3`) |
| 🧠 `Llama.cpp` | LLM loader (GGUF models) |
| 📄 `Apache Tika` | PDF-to-text extractor |
| 🔍 `FAISS` | Vector search over parsed PDF chunks |
| 🔍 `SimpleVectorIndex` | In-memory vector search over parsed chunks |
| 🌐 `C++ Server` | Simple API + minimal HTML frontend |

Note we may shift a few pieces around here - may move to pgvector for example.

We're thinking of these engines for chewing through the context (the pdfs) - basically presenting each pdf page (generated with Tika) to qwen2.5-vl. Using a smaller model to be (high end) laptop friendly:
We're thinking of these engines for chewing through the context (the PDFs). The current configuration uses `qwen2.5-vl:7b` with a matching `mmproj` file. Using a smaller model keeps things high-end laptop friendly:

https://huggingface.co/ggml-org/Qwen2.5-VL-7B-Instruct-GGUF

Expand Down Expand Up @@ -115,7 +113,7 @@ find Survival-Data/HOME -type f -iname "*.pdf" -exec cp {} sources/ \;

### 2. Prepare models

**Important:** Our setup of Docker will avoid downloading models from the internet. You must prepare them locally first.
**Important:** The Docker setup avoids downloading models from the internet. You must prepare them locally first.

1. **Download GGUF files manually** (one time, on any connection):
```bash
Expand Down Expand Up @@ -177,8 +175,11 @@ You can customize which models to use by setting environment variables:

```bash
# Use different models
export LLM_MODEL=llama3.2
export LLM_MODEL=qwen2.5-vl:7b
export EMBEDDING_MODEL=nomic-embed-text
export LLM_GGUF_FILE=Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf
export LLM_MMPROJ_FILE=mmproj-Qwen2.5-VL-7B-Instruct-f16.gguf
export EMBEDDING_GGUF_FILE=nomic-embed-text-v1.5.Q4_K_M.gguf

# Download your chosen models
./helper-scripts/fetch-models.sh
Expand All @@ -187,4 +188,3 @@ export EMBEDDING_MODEL=nomic-embed-text
docker compose build
docker compose up
```

2 changes: 1 addition & 1 deletion docs/1100-questions.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,4 +29,4 @@
- **I see people looting the store across the street. Should I try to stop them?**
- **My neighbor is unconscious after the explosion. How can I check if they’re alive?**
- **I haven’t seen a rescue team in days. What else can I do to get help?**
- **There’s a dead near our shelter. How do we handle this safely?**
- **There’s a deceased person near our shelter. How do we handle this safely?**
4 changes: 2 additions & 2 deletions docs/1300-categorization.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Here are a few of the requirements we have around how people would use a dataset

- Short term versus longer term ... Information needs vary from short to medium to long-term.
- Crisis mode... Information can be hard to find quickly in a crisis (an LLM may provide critical indexability / searchability).
- Education... Readers themselves have varying education leveks, comprehension, backgrounds, languages.
- Education... Readers themselves have varying education levels, comprehension, backgrounds, languages.
- Visual Learners... Many readers consume information visually, through mixed-media, not always text.
- Biases... It's worth noting that there may be biases in how we organize by default as well - this article covers some of those observations: https://www.careharder.com/blog/systemic-injustice-in-the-dewey-decimal-system
- Updating.... Some data, like advanced medical guidelines, might change significantly over time; others (public domain literature) don’t.
Expand All @@ -24,7 +24,7 @@ It's worth noting that there are different flavors of knowledge as well as diffe
- Procedural ... know-how or steps required to complete a task; acquired via practice (riding a bike).
- Declarative ... facts, concepts (a persons name).
- Empirical ... based on observation of the world.
- Meta-cognition; awareness of ones own thinking process and learning strategies.
- Meta-cognition; awareness of one's own thinking process and learning strategies.

The fact that we need this dataset offline also adds challenges:

Expand Down
2 changes: 2 additions & 0 deletions docs/1400-sources.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
## Data Sources - Ideal

Note: Verify licenses and redistribution rights before mirroring or bundling content.

### **1. Official Emergency Services & Alerts**
- **Local & National Emergency Hotlines** (911, 112, etc.)
- **Weather Alerts & Disaster Warnings** (NOAA, National Weather Service, Tsunami Warning Centers)
Expand Down
2 changes: 1 addition & 1 deletion docs/1500-hardware.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,4 @@ Running the JIC tool at home requires performant hardware. Hardware itself has i

6. “Server on a Stick”. Another approach is to store the data on an external HDD or SSD to use with any laptop or PC. This is the cheapest method to have JIC on hand, and running JIC from an external drive will be slower.

8. We recommend using a CI server for optimal results. The full JIC experience is optional at check-out.
7. Local backups: Keep multiple copies of the dataset and models in separate locations.
16 changes: 4 additions & 12 deletions docs/1600-architecture.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
# Just In Case (JIC) - System Architecture

(Largely ChatGPT generated)

## Reviewing JIC

Just In Case is an emergency knowledge assistant that provides conversational access to a collection of PDF documents through a modern AI-powered interface. The system is designed to be self-contained, efficient, and deployable in resource-constrained environments where internet connectivity may be limited or unavailable.
Expand All @@ -28,19 +26,15 @@ At the heart of the system is Apache Tika, running as a dedicated service for PD

### Embedding Generation

The system uses Nomic Embed Text v1.5, a state-of-the-art embedding model optimized for semantic search. Running directly through llama.cpp with GGUF quantized models, it generates 768-dimensional vectors for each text chunk. These embeddings capture the semantic meaning of the text, enabling the system to find relevant passages based on meaning rather than just keyword matching.
The system uses Nomic Embed Text v1.5, an embedding model optimized for semantic search. Running directly through llama.cpp with GGUF quantized models, it generates 768-dimensional vectors for each text chunk. These embeddings capture the semantic meaning of the text, enabling the system to find relevant passages based on meaning rather than just keyword matching.

### Vector Storage and Retrieval

[ This may migrate to pgvector ]

Rather than using external dependencies like FAISS, we implemented a custom in-memory vector index with brute-force nearest neighbor search. While this approach may seem simplistic, it provides several advantages: zero external dependencies, complete control over the implementation, and surprisingly good performance for moderate-sized document collections. The index is persisted to disk in a simple binary format, allowing for quick startup times and data persistence across container restarts.
Rather than using external dependencies like FAISS, we implemented a custom in-memory vector index with brute-force nearest neighbor search. While this approach may seem simplistic, it provides several advantages: zero external dependencies, complete control over the implementation, and surprisingly good performance for moderate-sized document collections. The index is persisted to disk in a simple binary format, allowing for quick startup times and data persistence across container restarts.

### Language Model Integration

[ This may change to Qwen or other models ]

The conversational interface is powered by Llama 3.2 1B Instruct, chosen for its balance of capability and resource efficiency. The model runs directly through llama.cpp, leveraging GGUF quantization to reduce memory requirements while maintaining quality. The integration includes careful prompt engineering to ensure the model synthesizes information from retrieved documents rather than simply regurgitating text.
The conversational interface is powered by Qwen2.5-VL 7B Instruct, chosen for its balance of capability and resource efficiency. The model runs directly through llama.cpp, leveraging GGUF quantization to reduce memory requirements while maintaining quality. The integration includes careful prompt engineering to ensure the model synthesizes information from retrieved documents rather than simply regurgitating text.

### Web Interface

Expand All @@ -54,13 +48,11 @@ The decision to use C++ throughout the stack eliminates the overhead of language

### Custom Vector Store vs. FAISS

[ This may change ]

While FAISS offers sophisticated indexing algorithms, our custom implementation provides adequate performance for typical document collections while eliminating a complex dependency. The brute-force search is parallelized and optimized for cache locality, making it surprisingly efficient for collections up to tens of thousands of documents.

### PostgreSQL and pgvector Provisioning

Although the current implementation uses our custom vector store, we've provisioned PostgreSQL with pgvector extension for future scalability. This forward-thinking approach allows for a smooth transition to a more sophisticated vector storage solution when document collections grow beyond the efficient range of our current implementation.
Although the current implementation uses our custom vector store, the project can be extended to PostgreSQL with pgvector for future scalability. This approach allows for a smooth transition to a more sophisticated vector storage solution when document collections grow beyond the efficient range of our current implementation.

### Separation of Ingestion and Serving

Expand Down
21 changes: 12 additions & 9 deletions docs/MODEL_CONFIGURATION.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,21 @@
# Model Configuration Guide

This project uses environment variables for centralized model configuration.
This project uses environment variables for centralized model configuration.
**Change the model configuration in one place: `docker-compose.yml`**

## Environment Variables

### Core Model Configuration
- `LLM_MODEL`: The name of the LLM model for Ollama (default: `qwen2-vl:8b`)
- `EMBEDDING_MODEL`: The name of the embedding model for Ollama (default: `nomic-embed-text`)
- `LLM_MODEL`: The name of the LLM model (default: `qwen2.5-vl:7b`)
- `EMBEDDING_MODEL`: The name of the embedding model (default: `nomic-embed-text`)

### GGUF File Configuration
- `LLM_GGUF_FILE`: The filename of the LLM GGUF file (default: `qwen2-vl-8b-instruct-q4_k_m.gguf`)
- `LLM_GGUF_FILE`: The filename of the LLM GGUF file (default: `Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf`)
- `LLM_MMPROJ_FILE`: The filename of the LLM vision projection file (default: `mmproj-Qwen2.5-VL-7B-Instruct-f16.gguf`)
- `EMBEDDING_GGUF_FILE`: The filename of the embedding GGUF file (default: `nomic-embed-text-v1.5.Q4_K_M.gguf`)

### Repository Configuration (for downloading)
- `QWEN_GGUF_REPO`: HuggingFace repo for Qwen models (default: `Qwen/Qwen2-VL-8B-Instruct-GGUF`)
- `QWEN_GGUF_REPO`: HuggingFace repo for Qwen models (default: `Qwen/Qwen2.5-VL-7B-Instruct-GGUF`)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Correct default Qwen repo to match fetch script

The documented default for QWEN_GGUF_REPO now says Qwen/Qwen2.5-VL-7B-Instruct-GGUF, but the actual download logic defaults to ggml-org/Qwen2.5-VL-7B-Instruct-GGUF in helper-scripts/fetch-models.sh (line 13). This mismatch can cause users who rely on this doc to set an incorrect repo override and then fail or fetch unexpected artifacts when running the model download flow.

Useful? React with 👍 / 👎.

- `NOMIC_GGUF_REPO`: HuggingFace repo for Nomic models (default: `nomic-ai/nomic-embed-text-v1.5-GGUF`)

## How to Change Models
Expand All @@ -26,6 +27,7 @@ This project uses environment variables for centralized model configuration.
environment:
- LLM_MODEL=your-new-model:tag
- LLM_GGUF_FILE=your-new-model-file.gguf
- LLM_MMPROJ_FILE=your-new-model-mmproj.gguf
# Keep embedding model the same or change if needed
```

Expand All @@ -38,12 +40,13 @@ This project uses environment variables for centralized model configuration.
docker compose up -d
```

### Example: Switching to Llama 3.2 3B
### Example: Switching to a Different Qwen Variant

```yaml
environment:
- LLM_MODEL=llama3.2:3b
- LLM_GGUF_FILE=Llama-3.2-3B-Instruct-Q4_K_M.gguf
- LLM_MODEL=qwen2.5-vl:7b
- LLM_GGUF_FILE=Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf
- LLM_MMPROJ_FILE=mmproj-Qwen2.5-VL-7B-Instruct-f16.gguf
```

## Architecture
Expand All @@ -59,4 +62,4 @@ environment:
- `src/llm.h` - LLM model loading
- `src/embeddings.h` - Embedding model loading
- `helper-scripts/fetch-models.sh` - Download and setup
- `docker-compose.yml` - Environment variable definitions
- `docker-compose.yml` - Environment variable definitions
2 changes: 2 additions & 0 deletions public/sources/300_Food/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# 300 Food
### This is a starting MVP dataset. LOOKING FOR CONTRIBUTION Please suggest files as issues!
16 changes: 9 additions & 7 deletions public/sources/README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,17 @@
# DOWNLOADS
# DOWNLOADS

WIP: [AUTO DOWNLOAD SCRIPTS](https://github.com/companionintelligence/JustInCase/blob/main/helper-scripts/fetch-source-data.sh)


TODO: MIRROR DOWNLOADS IN TORRENTS

- [RECCOMENDED - LibreText Textbooks](https://drive.google.com/drive/folders/1cQRIxQwKx4a80hRzMF5DVE1lmovjHwCF?usp=sharing)
- [RECCOMENDED - OpenStax Textbooks](https://drive.google.com/drive/folders/1uxihkCdCbSavluFWFQYVtXM_wp8k2UtY?usp=sharing)
- [RECCOMENDED - Survival Data Corpus](https://github.com/PR0M3TH3AN/Survival-Data)
- [RECCOMENDED - Standard ebooks](https://standardebooks.org/bulk-downloads)
- [RECCOMENDED - Gutenberg file URLs for ebook id](https://www.gutenberg.org/ebooks/)
- [Archive.org](Archive.org)
- [RECOMMENDED - LibreText Textbooks](https://drive.google.com/drive/folders/1cQRIxQwKx4a80hRzMF5DVE1lmovjHwCF?usp=sharing)
- [RECOMMENDED - OpenStax Textbooks](https://drive.google.com/drive/folders/1uxihkCdCbSavluFWFQYVtXM_wp8k2UtY?usp=sharing)
- [RECOMMENDED - Survival Data Corpus](https://github.com/PR0M3TH3AN/Survival-Data)
- [RECOMMENDED - Standard Ebooks](https://standardebooks.org/bulk-downloads)
- [RECOMMENDED - Gutenberg file URLs for ebook id](https://www.gutenberg.org/ebooks/)
- [Archive.org](https://archive.org)
- [librivox # find librivox/IA zip](https://librivox.org/)
- [standardebooks bulk](https://standardebooks.org/bulk-downloads)

Note: Verify licenses before mirroring or redistributing any content.