@@ -9,7 +9,8 @@ A powerful semantic search system for log files that enables natural language qu
99- ** Local LLM Integration** : Generates AI responses using Ollama with customizable models
1010- ** Interactive Query Interface** : Rich terminal interface with markdown rendering
1111- ** GPU Acceleration** : Optional GPU support for faster embedding generation
12- - ** Comprehensive File Support** : Indexes ` .py ` , ` .log ` , ` .js ` , ` .ts ` , ` .md ` , ` .sql ` , ` .html ` , ` .csv ` files
12+ - ** Automatic File Detection** : Intelligently detects and indexes all text-based files by content analysis
13+ - ** Security-First Design** : Client-side trust_remote_code management with consent prompts and persistent tracking
1314- ** Environment Configuration** : Fully configurable via ` .env ` files
1415
1516## Quick Start
@@ -63,25 +64,30 @@ USE_LOCAL_EMBEDDINGS=true
6364USE_LOCAL_OLLAMA=true
6465```
6566
66- ### 3. Index Your Log Files
67+ ### 3. Index Your Files
6768
68- Index a directory using local embeddings (default) :
69+ Index a directory with automatic file detection :
6970
7071``` bash
71- python index.py /path/to/your/logs
72+ python index.py /path/to/your/files
7273```
7374
75+ The system will:
76+ - Automatically detect all text-based files by content analysis
77+ - Skip binary files and common build/cache directories
78+ - Prompt for trust_remote_code consent if needed for the embedding model
79+
7480Or specify embedding type:
7581
7682``` bash
77- # Use local SentenceTransformer embeddings
78- python index.py /path/to/logs --local-embeddings
83+ # Use local SentenceTransformer embeddings (default)
84+ python index.py /path/to/files --local-embeddings
7985
8086# Use Ollama embeddings
81- python index.py /path/to/logs --ollama-embeddings
87+ python index.py /path/to/files --ollama-embeddings
8288
8389# Use remote embedding server
84- python index.py /path/to/logs --remote-embeddings
90+ python index.py /path/to/files --remote-embeddings
8591```
8692
8793Additional options:
@@ -94,14 +100,19 @@ python index.py /path/to/logs --model custom-model --chunk-size 1500
94100python index.py /path/to/logs --chroma-path ./my_custom_db
95101```
96102
97- ### 4. Query Your Logs
103+ ### 4. Query Your Indexed Content
98104
99105Start the interactive query interface:
100106
101107``` bash
102108python ask.py
103109```
104110
111+ The system will:
112+ - Auto-detect the embedding type used during indexing
113+ - Apply same trust_remote_code settings for consistency
114+ - Generate responses using Ollama's local LLM
115+
105116Or specify a custom output file:
106117
107118``` bash
@@ -113,22 +124,32 @@ python ask.py my_queries.md
113124### Core Components
114125
1151261 . ** Unified Indexer (` index.py ` )**
116- - Processes repositories and creates vector embeddings
127+ - Processes repositories with automatic file detection
117128 - Supports multiple embedding strategies via handler classes
118- - Chunks code into configurable segments (default: 2000 characters)
129+ - Chunks content into configurable segments (default: 2000 characters)
130+ - Client-side trust_remote_code management
119131 - Stores embeddings in ChromaDB with metadata tracking
120132
1211332 . ** Query Interface (` ask.py ` )**
122- - Interactive CLI for natural language log queries
123- - Auto-detects embedding type from metadata
134+ - Interactive CLI for natural language queries
135+ - Auto-detects embedding type and trust settings from metadata
124136 - Generates responses using Ollama's local LLM
137+ - Consistent security model with indexing phase
125138 - Saves all Q&A pairs with timestamps
126139
1271403 . ** Embedding Server (` embedding_server.py ` )**
128141 - Optional remote embedding service with GPU support
142+ - Respects client-side trust_remote_code decisions
129143 - RESTful API with health checks and server info
130- - Configurable via command-line arguments
131- - Supports batch processing and model caching
144+ - Dynamic model loading with trust setting caching
145+ - Supports batch processing and multiple model variants
146+
147+ 4 . ** Trust Manager (` trust_manager.py ` )**
148+ - Centralized security management for trust_remote_code
149+ - Auto-detection of models requiring remote code execution
150+ - Interactive consent prompts with risk/benefit explanations
151+ - Persistent approval tracking in .env files
152+ - CLI tools for managing trust settings
132153
133154### Embedding Handlers
134155
@@ -149,6 +170,7 @@ python ask.py my_queries.md
149170| ` CHROMA_PATH ` | ChromaDB storage path | ` ./chroma_code ` |
150171| ` USE_LOCAL_EMBEDDINGS ` | Default embedding strategy | ` true ` |
151172| ` USE_LOCAL_OLLAMA ` | Use local Ollama instance | ` true ` |
173+ | ` TRUST_REMOTE_CODE_* ` | Model-specific trust settings | Auto-managed |
152174
153175### Command Line Options
154176
@@ -180,6 +202,73 @@ Options:
180202 --debug Enable debug mode
181203```
182204
205+ ## Security: Trust Remote Code Management
206+
207+ The system includes a comprehensive security framework for models that require ` trust_remote_code=True ` . This client-side security system:
208+
209+ - ** Auto-detects** which models likely need remote code execution based on known patterns
210+ - ** Prompts for informed consent** with detailed security warnings
211+ - ** Persists decisions** in ` .env ` with model-specific hash tracking
212+ - ** Client-side control** - trust decisions made locally, not on remote servers
213+ - ** Cross-component consistency** - same security model for indexing, querying, and serving
214+
215+ ### How It Works
216+
217+ 1 . ** Detection** : System analyzes model names against known patterns
218+ 2 . ** User Consent** : Interactive prompts with clear risk/benefit explanations
219+ 3 . ** Persistence** : Decisions saved locally with model identification hashes
220+ 4 . ** Communication** : Client sends trust settings to remote embedding servers
221+
222+ ### Managing Trust Settings
223+
224+ ``` bash
225+ # List all approved/denied models
226+ python trust_manager.py --list
227+
228+ # Check if a specific model needs trust_remote_code
229+ python trust_manager.py --check " nomic-ai/nomic-embed-text-v1.5"
230+ ```
231+
232+ ### Security Flow
233+
234+ When you first use a model requiring remote code execution:
235+
236+ ```
237+ ==============================================================
238+ SECURITY WARNING: Remote Code Execution
239+ ==============================================================
240+ Model: nomic-ai/nomic-embed-text-v1.5
241+
242+ This model may require 'trust_remote_code=True' which allows
243+ the model to execute arbitrary code during loading.
244+
245+ RISKS:
246+ - The model could execute malicious code
247+ - Your system could be compromised
248+ - Data could be stolen or corrupted
249+
250+ BENEFITS:
251+ - Access to newer/specialized models
252+ - Better embedding quality for some models
253+
254+ Your choice will be saved for this model.
255+ ==============================================================
256+ Allow remote code execution for this model? [y/N]:
257+ ```
258+
259+ ### Trust Settings Storage
260+
261+ Approval decisions are stored in your ` .env ` file:
262+
263+ ``` bash
264+ # Example entries (automatically managed)
265+ # TRUST_REMOTE_CODE_A1B2C3D4_MODEL=nomic-ai/nomic-embed-text-v1.5
266+ TRUST_REMOTE_CODE_A1B2C3D4=true
267+
268+ # TRUST_REMOTE_CODE_E5F6G7H8_MODEL=sentence-transformers/all-MiniLM-L6-v2
269+ TRUST_REMOTE_CODE_E5F6G7H8=false
270+ ```
271+
183272## Advanced Usage
184273
185274### Remote Embedding Server
@@ -257,23 +346,33 @@ The system automatically detects and works with databases created by older versi
257346## Dependencies
258347
259348- ** chromadb** : Vector database for embeddings
260- - ** sentence-transformers** : Local embedding generation
349+ - ** sentence-transformers** : Local embedding generation (optional, only needed for local embeddings)
261350- ** ollama** : LLM client for local inference
262351- ** rich** : Enhanced terminal output and markdown rendering
263352- ** flask** : Web server for embedding API
264353- ** python-dotenv** : Environment configuration management
354+ - ** tiktoken** : Token counting utilities
355+ - ** einops** : Tensor operations for advanced models
356+ - ** requests** : HTTP client for remote services
265357
266358## File Structure
267359
268360```
269361├── index.py # Unified indexing script
270362├── ask.py # Interactive query interface
271363├── embedding_server.py # Remote embedding server
364+ ├── trust_manager.py # Security: trust_remote_code management
272365├── requirements.txt # Python dependencies
273366├── .env_example # Environment configuration template
274367└── chroma_code/ # Default ChromaDB storage (created after indexing)
275368```
276369
277370## License
278371
279- This project is designed for local development and research use. Please ensure compliance with the terms of service for any external models or APIs used.
372+ This project is designed for local development and research use. Please ensure compliance with the terms of service for any external models or APIs used.
373+
374+ ## Contributions
375+
376+ I welcome any assistance on this project, especially around trying new models for better performance and testing against ore logs than I have at my disposal!
377+
378+ Please just fork off of dev and then submit a PR
0 commit comments