Skip to content

Commit 2d375d4

Browse files
authored
Merge pull request #8 from barrulus/dev
Major Feature Update: Security Framework, Auto File Detection, and Apple Silicon Support
2 parents 9420a2d + c0e68c9 commit 2d375d4

File tree

11 files changed

+611
-102
lines changed

11 files changed

+611
-102
lines changed

.env.example

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# Environment Configuration for Vector Code Retrieval System
2+
3+
# Ollama Configuration
4+
OLLAMA_HOST=http://localhost:11434
5+
OLLAMA_MODEL=qwen3:8b
6+
7+
# Embedding Configuration
8+
EMBEDDING_SERVER=http://localhost:5000
9+
EMBEDDING_MODEL=nomic-ai/nomic-embed-text-v1.5
10+
11+
# Storage Configuration
12+
CHROMA_PATH=./chroma_code
13+
14+
# Service Selection
15+
USE_LOCAL_EMBEDDINGS=true
16+
USE_LOCAL_OLLAMA=true
17+
18+
# Trust Remote Code Settings (automatically managed by trust_manager.py)
19+
# Format: TRUST_REMOTE_CODE_<MODEL_HASH>=true|false
20+
# These are automatically added when you approve/deny models
21+
# Example:
22+
# # TRUST_REMOTE_CODE_A1B2C3D4_MODEL=nomic-ai/nomic-embed-text-v1.5
23+
# TRUST_REMOTE_CODE_A1B2C3D4=true

.env_example

Lines changed: 0 additions & 14 deletions
This file was deleted.

.gitignore_example

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
.gitignore
2+
venv/
3+
.vscode/
4+
__pycache__/
5+
chroma_code/
6+
chroma_db/
7+
*-queries.md

.vscode/settings.json

Lines changed: 0 additions & 4 deletions
This file was deleted.

README.md

Lines changed: 116 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,8 @@ A powerful semantic search system for log files that enables natural language qu
99
- **Local LLM Integration**: Generates AI responses using Ollama with customizable models
1010
- **Interactive Query Interface**: Rich terminal interface with markdown rendering
1111
- **GPU Acceleration**: Optional GPU support for faster embedding generation
12-
- **Comprehensive File Support**: Indexes `.py`, `.log`, `.js`, `.ts`, `.md`, `.sql`, `.html`, `.csv` files
12+
- **Automatic File Detection**: Intelligently detects and indexes all text-based files by content analysis
13+
- **Security-First Design**: Client-side trust_remote_code management with consent prompts and persistent tracking
1314
- **Environment Configuration**: Fully configurable via `.env` files
1415

1516
## Quick Start
@@ -63,25 +64,30 @@ USE_LOCAL_EMBEDDINGS=true
6364
USE_LOCAL_OLLAMA=true
6465
```
6566

66-
### 3. Index Your Log Files
67+
### 3. Index Your Files
6768

68-
Index a directory using local embeddings (default):
69+
Index a directory with automatic file detection:
6970

7071
```bash
71-
python index.py /path/to/your/logs
72+
python index.py /path/to/your/files
7273
```
7374

75+
The system will:
76+
- Automatically detect all text-based files by content analysis
77+
- Skip binary files and common build/cache directories
78+
- Prompt for trust_remote_code consent if needed for the embedding model
79+
7480
Or specify embedding type:
7581

7682
```bash
77-
# Use local SentenceTransformer embeddings
78-
python index.py /path/to/logs --local-embeddings
83+
# Use local SentenceTransformer embeddings (default)
84+
python index.py /path/to/files --local-embeddings
7985

8086
# Use Ollama embeddings
81-
python index.py /path/to/logs --ollama-embeddings
87+
python index.py /path/to/files --ollama-embeddings
8288

8389
# Use remote embedding server
84-
python index.py /path/to/logs --remote-embeddings
90+
python index.py /path/to/files --remote-embeddings
8591
```
8692

8793
Additional options:
@@ -94,14 +100,19 @@ python index.py /path/to/logs --model custom-model --chunk-size 1500
94100
python index.py /path/to/logs --chroma-path ./my_custom_db
95101
```
96102

97-
### 4. Query Your Logs
103+
### 4. Query Your Indexed Content
98104

99105
Start the interactive query interface:
100106

101107
```bash
102108
python ask.py
103109
```
104110

111+
The system will:
112+
- Auto-detect the embedding type used during indexing
113+
- Apply same trust_remote_code settings for consistency
114+
- Generate responses using Ollama's local LLM
115+
105116
Or specify a custom output file:
106117

107118
```bash
@@ -113,22 +124,32 @@ python ask.py my_queries.md
113124
### Core Components
114125

115126
1. **Unified Indexer (`index.py`)**
116-
- Processes repositories and creates vector embeddings
127+
- Processes repositories with automatic file detection
117128
- Supports multiple embedding strategies via handler classes
118-
- Chunks code into configurable segments (default: 2000 characters)
129+
- Chunks content into configurable segments (default: 2000 characters)
130+
- Client-side trust_remote_code management
119131
- Stores embeddings in ChromaDB with metadata tracking
120132

121133
2. **Query Interface (`ask.py`)**
122-
- Interactive CLI for natural language log queries
123-
- Auto-detects embedding type from metadata
134+
- Interactive CLI for natural language queries
135+
- Auto-detects embedding type and trust settings from metadata
124136
- Generates responses using Ollama's local LLM
137+
- Consistent security model with indexing phase
125138
- Saves all Q&A pairs with timestamps
126139

127140
3. **Embedding Server (`embedding_server.py`)**
128141
- Optional remote embedding service with GPU support
142+
- Respects client-side trust_remote_code decisions
129143
- RESTful API with health checks and server info
130-
- Configurable via command-line arguments
131-
- Supports batch processing and model caching
144+
- Dynamic model loading with trust setting caching
145+
- Supports batch processing and multiple model variants
146+
147+
4. **Trust Manager (`trust_manager.py`)**
148+
- Centralized security management for trust_remote_code
149+
- Auto-detection of models requiring remote code execution
150+
- Interactive consent prompts with risk/benefit explanations
151+
- Persistent approval tracking in .env files
152+
- CLI tools for managing trust settings
132153

133154
### Embedding Handlers
134155

@@ -149,6 +170,7 @@ python ask.py my_queries.md
149170
| `CHROMA_PATH` | ChromaDB storage path | `./chroma_code` |
150171
| `USE_LOCAL_EMBEDDINGS` | Default embedding strategy | `true` |
151172
| `USE_LOCAL_OLLAMA` | Use local Ollama instance | `true` |
173+
| `TRUST_REMOTE_CODE_*` | Model-specific trust settings | Auto-managed |
152174

153175
### Command Line Options
154176

@@ -180,6 +202,73 @@ Options:
180202
--debug Enable debug mode
181203
```
182204

205+
## Security: Trust Remote Code Management
206+
207+
The system includes a comprehensive security framework for models that require `trust_remote_code=True`. This client-side security system:
208+
209+
- **Auto-detects** which models likely need remote code execution based on known patterns
210+
- **Prompts for informed consent** with detailed security warnings
211+
- **Persists decisions** in `.env` with model-specific hash tracking
212+
- **Client-side control** - trust decisions made locally, not on remote servers
213+
- **Cross-component consistency** - same security model for indexing, querying, and serving
214+
215+
### How It Works
216+
217+
1. **Detection**: System analyzes model names against known patterns
218+
2. **User Consent**: Interactive prompts with clear risk/benefit explanations
219+
3. **Persistence**: Decisions saved locally with model identification hashes
220+
4. **Communication**: Client sends trust settings to remote embedding servers
221+
222+
### Managing Trust Settings
223+
224+
```bash
225+
# List all approved/denied models
226+
python trust_manager.py --list
227+
228+
# Check if a specific model needs trust_remote_code
229+
python trust_manager.py --check "nomic-ai/nomic-embed-text-v1.5"
230+
```
231+
232+
### Security Flow
233+
234+
When you first use a model requiring remote code execution:
235+
236+
```
237+
==============================================================
238+
SECURITY WARNING: Remote Code Execution
239+
==============================================================
240+
Model: nomic-ai/nomic-embed-text-v1.5
241+
242+
This model may require 'trust_remote_code=True' which allows
243+
the model to execute arbitrary code during loading.
244+
245+
RISKS:
246+
- The model could execute malicious code
247+
- Your system could be compromised
248+
- Data could be stolen or corrupted
249+
250+
BENEFITS:
251+
- Access to newer/specialized models
252+
- Better embedding quality for some models
253+
254+
Your choice will be saved for this model.
255+
==============================================================
256+
Allow remote code execution for this model? [y/N]:
257+
```
258+
259+
### Trust Settings Storage
260+
261+
Approval decisions are stored in your `.env` file:
262+
263+
```bash
264+
# Example entries (automatically managed)
265+
# TRUST_REMOTE_CODE_A1B2C3D4_MODEL=nomic-ai/nomic-embed-text-v1.5
266+
TRUST_REMOTE_CODE_A1B2C3D4=true
267+
268+
# TRUST_REMOTE_CODE_E5F6G7H8_MODEL=sentence-transformers/all-MiniLM-L6-v2
269+
TRUST_REMOTE_CODE_E5F6G7H8=false
270+
```
271+
183272
## Advanced Usage
184273

185274
### Remote Embedding Server
@@ -257,23 +346,33 @@ The system automatically detects and works with databases created by older versi
257346
## Dependencies
258347

259348
- **chromadb**: Vector database for embeddings
260-
- **sentence-transformers**: Local embedding generation
349+
- **sentence-transformers**: Local embedding generation (optional, only needed for local embeddings)
261350
- **ollama**: LLM client for local inference
262351
- **rich**: Enhanced terminal output and markdown rendering
263352
- **flask**: Web server for embedding API
264353
- **python-dotenv**: Environment configuration management
354+
- **tiktoken**: Token counting utilities
355+
- **einops**: Tensor operations for advanced models
356+
- **requests**: HTTP client for remote services
265357

266358
## File Structure
267359

268360
```
269361
├── index.py # Unified indexing script
270362
├── ask.py # Interactive query interface
271363
├── embedding_server.py # Remote embedding server
364+
├── trust_manager.py # Security: trust_remote_code management
272365
├── requirements.txt # Python dependencies
273366
├── .env_example # Environment configuration template
274367
└── chroma_code/ # Default ChromaDB storage (created after indexing)
275368
```
276369

277370
## License
278371

279-
This project is designed for local development and research use. Please ensure compliance with the terms of service for any external models or APIs used.
372+
This project is designed for local development and research use. Please ensure compliance with the terms of service for any external models or APIs used.
373+
374+
## Contributions
375+
376+
I welcome any assistance on this project, especially around trying new models for better performance and testing against ore logs than I have at my disposal!
377+
378+
Please just fork off of dev and then submit a PR

__pycache__/ask.cpython-311.pyc

-5.92 KB
Binary file not shown.
-8.04 KB
Binary file not shown.

0 commit comments

Comments
 (0)