Skip to content

Commit 08fb3ee

Browse files
committed
feat: added ollama official support
1 parent f36186c commit 08fb3ee

File tree

5 files changed

+265
-230
lines changed

5 files changed

+265
-230
lines changed

agentic_rag/README.md

Lines changed: 26 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ The system has the following features:
1414
- Smart context retrieval and response generation
1515
- FastAPI-based REST API for document upload and querying
1616
- Support for both OpenAI-based agents or local, transformer-based agents (`Mistral-7B` by default)
17-
- Support for quantized models (4-bit/8-bit) and GGUF models for faster inference
17+
- Support for quantized models (4-bit/8-bit) and Ollama models for faster inference
1818
- Optional Chain of Thought (CoT) reasoning for more detailed and structured responses
1919

2020
<img src="img/gradio_1.png" alt="Gradio Interface" width="80%">
@@ -43,7 +43,7 @@ Here you can find a result of using Chain of Thought (CoT) reasoning:
4343
- GPU with 8GB VRAM recommended for better performance
4444
- Will run on CPU if GPU is not available, but will be significantly slower.
4545
- For quantized models (4-bit/8-bit): Reduced VRAM requirements (4-6GB) with minimal performance impact
46-
- For GGUF models: Further reduced memory requirements, with models automatically optimizing GPU usage based on available VRAM
46+
- For Ollama models: Requires Ollama to be installed and running, with significantly reduced memory requirements
4747

4848
### Setup
4949

@@ -55,7 +55,7 @@ Here you can find a result of using Chain of Thought (CoT) reasoning:
5555
pip install -r requirements.txt
5656
```
5757

58-
2. Authenticate with HuggingFace:
58+
2. Authenticate with HuggingFace (for Hugging Face models only):
5959

6060
The system uses `Mistral-7B` by default, which requires authentication with HuggingFace:
6161

@@ -78,14 +78,29 @@ Here you can find a result of using Chain of Thought (CoT) reasoning:
7878
7979
If no API key is provided, the system will automatically download and use `Mistral-7B-Instruct-v0.2` for text generation when using the local model. No additional configuration is needed.
8080
81-
2. For quantized models and GGUF support, ensure these packages are installed:
81+
4. For quantized models, ensure bitsandbytes is installed:
8282
8383
```bash
84-
sudo apt install build-essential
85-
pip install bitsandbytes>=0.41.0 llama-cpp-python>=0.2.38 huggingface_hub
84+
pip install bitsandbytes>=0.41.0
8685
```
8786
88-
3. To run `llama-cpp` models on Windows, you will need to install Visual C++ Build Tools, and install the C++ development-related components throughout the installation procedure. Choosing the "C++ build tools" and any necessary libraries or packages is usually part of this.
87+
5. For Ollama models, install Ollama:
88+
89+
a. Download and install Ollama from [ollama.com/download](https://ollama.com/download) for Windows, or run the following command in Linux:
90+
91+
```bash
92+
curl -fsSL https://ollama.com/install.sh | sh
93+
```
94+
95+
b. Start the Ollama service
96+
97+
c. Pull the models you want to use beforehand:
98+
99+
```bash
100+
ollama pull llama3
101+
ollama pull phi3
102+
ollama pull qwen2
103+
```
89104
90105
## 1. Getting Started
91106
@@ -129,16 +144,17 @@ This will start the Gradio server and automatically open the interface in your d
129144
130145
3. **Chat Interface**:
131146
- Select between different model options:
132-
- Local (Mistral) - Default Mistral-7B model
147+
- Local (Mistral) - Default Mistral-7B model (recommended)
133148
- Local (Mistral) with 4-bit or 8-bit quantization for faster inference
134-
- GGUF models (Phi-4-mini, Qwen_QwQ-32B, TinyR1-32B) for optimized performance
149+
- Ollama models (llama3, phi-3, qwen2) as alternative options
135150
- OpenAI (if API key is configured)
136151
- Toggle Chain of Thought reasoning for more detailed responses
137152
- Chat with your documents using natural language
138153
- Clear chat history as needed
139154
140155
Note: The interface will automatically detect available models based on your configuration:
141-
- Local Mistral model requires HuggingFace token in `config.yaml`
156+
- Local Mistral model requires HuggingFace token in `config.yaml` (default option)
157+
- Ollama models require Ollama to be installed and running (alternative options)
142158
- OpenAI model requires API key in `.env` file
143159
144160
### 3. Using Individual Python Components via Command Line

agentic_rag/gradio_app.py

Lines changed: 119 additions & 89 deletions
Original file line numberDiff line numberDiff line change
@@ -101,32 +101,46 @@ def chat(message: str, history: List[List[str]], agent_type: str, use_cot: bool,
101101
elif "8-bit" in agent_type:
102102
quantization = "8bit"
103103
model_type = "Local (Mistral)"
104-
elif "GGUF" in agent_type:
105-
model_type = "GGUF"
104+
elif "Ollama" in agent_type:
105+
model_type = "Ollama"
106106
# Extract model name from agent_type
107-
if "Phi-4-mini" in agent_type:
108-
model_name = "unsloth/Phi-4-mini-instruct-GGUF"
109-
elif "Qwen_QwQ-32B" in agent_type:
110-
model_name = "bartowski/Qwen_QwQ-32B-GGUF"
111-
elif "TinyR1-32B" in agent_type:
112-
model_name = "bartowski/qihoo360_TinyR1-32B-Preview-GGUF"
107+
if "llama3" in agent_type.lower():
108+
model_name = "ollama:llama3"
109+
elif "phi-3" in agent_type.lower():
110+
model_name = "ollama:phi3"
111+
elif "qwen2" in agent_type.lower():
112+
model_name = "ollama:qwen2"
113113
else:
114114
model_type = agent_type
115115

116116
# Select appropriate agent and reinitialize with correct settings
117-
if "Local" in model_type or model_type == "GGUF":
117+
if "Local" in model_type:
118+
# For HF models, we need the token
118119
if not hf_token:
119120
response_text = "Local agent not available. Please check your HuggingFace token configuration."
120121
print(f"Error: {response_text}")
121122
return history + [[message, response_text]]
122-
123-
# Use specified model_name for GGUF models, otherwise use default
124-
if model_type == "GGUF" and model_name:
125-
agent = LocalRAGAgent(vector_store, model_name=model_name, use_cot=use_cot,
126-
collection=collection, skip_analysis=skip_analysis)
123+
agent = LocalRAGAgent(vector_store, use_cot=use_cot, collection=collection,
124+
skip_analysis=skip_analysis, quantization=quantization)
125+
elif model_type == "Ollama":
126+
# For Ollama models
127+
if model_name:
128+
try:
129+
agent = LocalRAGAgent(vector_store, model_name=model_name, use_cot=use_cot,
130+
collection=collection, skip_analysis=skip_analysis)
131+
except Exception as e:
132+
response_text = f"Error initializing Ollama model: {str(e)}. Falling back to Local Mistral."
133+
print(f"Error: {response_text}")
134+
# Fall back to Mistral if Ollama fails
135+
if hf_token:
136+
agent = LocalRAGAgent(vector_store, use_cot=use_cot, collection=collection,
137+
skip_analysis=skip_analysis)
138+
else:
139+
return history + [[message, "Local Mistral agent not available for fallback. Please check your HuggingFace token configuration."]]
127140
else:
128-
agent = LocalRAGAgent(vector_store, use_cot=use_cot, collection=collection,
129-
skip_analysis=skip_analysis, quantization=quantization)
141+
response_text = "Ollama model not specified correctly."
142+
print(f"Error: {response_text}")
143+
return history + [[message, response_text]]
130144
else:
131145
if not openai_key:
132146
response_text = "OpenAI agent not available. Please check your OpenAI API key configuration."
@@ -217,15 +231,19 @@ def create_interface():
217231

218232
# Create model choices list for reuse
219233
model_choices = []
234+
# HF models first if token is available
220235
if hf_token:
221236
model_choices.extend([
222237
"Local (Mistral)",
223238
"Local (Mistral) - 4-bit Quantized",
224239
"Local (Mistral) - 8-bit Quantized",
225-
"GGUF - Phi-4-mini-instruct",
226-
"GGUF - Qwen_QwQ-32B",
227-
"GGUF - TinyR1-32B-Preview"
228240
])
241+
# Then Ollama models (don't require HF token)
242+
model_choices.extend([
243+
"Ollama - llama3",
244+
"Ollama - phi-3",
245+
"Ollama - qwen2"
246+
])
229247
if openai_key:
230248
model_choices.append("OpenAI")
231249

@@ -235,10 +253,16 @@ def create_interface():
235253
## Model Management
236254
237255
Download models in advance to prepare them for use in the chat interface.
238-
This can help avoid delays when first using a model and ensure all models are properly downloaded.
239256
240-
> **Note**: Some models may require accepting terms on the Hugging Face website before downloading.
241-
> If you encounter a 401 error, please follow the link provided to accept the model terms.
257+
### Hugging Face Models (Default)
258+
259+
The system uses Mistral-7B by default. For Hugging Face models (Mistral), you'll need a Hugging Face token in your config.yaml file.
260+
261+
### Ollama Models (Alternative)
262+
263+
Ollama models are available as alternatives. For Ollama models, this will pull the model using the Ollama client.
264+
Make sure Ollama is installed and running on your system.
265+
You can download Ollama from [ollama.com/download](https://ollama.com/download)
242266
""")
243267

244268
with gr.Row():
@@ -275,20 +299,20 @@ def create_interface():
275299
- VRAM Required: ~6GB
276300
- Balance between quality and memory usage
277301
278-
**GGUF - Phi-4-mini-instruct**: Microsoft's Phi-4-mini model in GGUF format.
279-
- Size: ~2-4GB
280-
- VRAM Required: Scales based on available VRAM
281-
- Efficient small model with good performance
302+
**Ollama - llama3**: Meta's Llama 3 model via Ollama.
303+
- Size: ~4GB
304+
- Requires Ollama to be installed and running
305+
- Excellent performance and quality
282306
283-
**GGUF - Qwen_QwQ-32B**: Qwen 32B model in GGUF format.
284-
- Size: ~20GB
285-
- VRAM Required: Scales based on available VRAM
286-
- High-quality large model
307+
**Ollama - phi-3**: Microsoft's Phi-3 model via Ollama.
308+
- Size: ~4GB
309+
- Requires Ollama to be installed and running
310+
- Efficient small model with good performance
287311
288-
**GGUF - TinyR1-32B-Preview**: Qihoo 360's TinyR1 32B model in GGUF format.
289-
- Size: ~20GB
290-
- VRAM Required: Scales based on available VRAM
291-
- High-quality large model
312+
**Ollama - qwen2**: Alibaba's Qwen2 model via Ollama.
313+
- Size: ~4GB
314+
- Requires Ollama to be installed and running
315+
- High-quality model with good performance
292316
""")
293317

294318
# Document Processing Tab
@@ -480,64 +504,22 @@ def download_model(model_type: str) -> str:
480504

481505
# Parse model type to determine model and quantization
482506
quantization = None
483-
model_name = "mistralai/Mistral-7B-Instruct-v0.2" # Default model
484-
485-
if "4-bit" in model_type:
486-
quantization = "4bit"
487-
elif "8-bit" in model_type:
488-
quantization = "8bit"
489-
elif "GGUF" in model_type:
490-
# Extract model name from model_type
491-
if "Phi-4-mini" in model_type:
492-
model_name = "unsloth/Phi-4-mini-instruct-GGUF"
493-
elif "Qwen_QwQ-32B" in model_type:
494-
model_name = "bartowski/Qwen_QwQ-32B-GGUF"
495-
elif "TinyR1-32B" in model_type:
496-
model_name = "bartowski/qihoo360_TinyR1-32B-Preview-GGUF"
497-
498-
# Check if HuggingFace token is available
499-
if not hf_token:
500-
return "❌ Error: HuggingFace token not found in config.yaml. Please add your token first."
501-
502-
# Start download timer
503-
start_time = time.time()
507+
model_name = None
504508

505-
# For GGUF models, use the GGUFModelHandler to download
506-
if "GGUF" in model_type:
507-
try:
508-
from local_rag_agent import GGUFModelHandler
509-
from huggingface_hub import list_repo_files
510-
511-
# Extract repo_id
512-
parts = model_name.split('/')
513-
repo_id = '/'.join(parts[:2])
509+
if "4-bit" in model_type or "8-bit" in model_type:
510+
# For HF models, we need the token
511+
if not hf_token:
512+
return "❌ Error: HuggingFace token not found in config.yaml. Please add your token first."
513+
514+
model_name = "mistralai/Mistral-7B-Instruct-v0.2" # Default model
515+
if "4-bit" in model_type:
516+
quantization = "4bit"
517+
elif "8-bit" in model_type:
518+
quantization = "8bit"
514519

515-
# Check if model is gated
516-
try:
517-
files = list_repo_files(repo_id, token=hf_token)
518-
gguf_files = [f for f in files if f.endswith('.gguf')]
519-
520-
if not gguf_files:
521-
return f"❌ Error: No GGUF files found in repo: {repo_id}"
522-
523-
# Download the model
524-
handler = GGUFModelHandler(model_name)
525-
526-
# Calculate download time
527-
download_time = time.time() - start_time
528-
return f"✅ Successfully downloaded {model_type} in {download_time:.1f} seconds."
529-
530-
except Exception as e:
531-
if "401" in str(e):
532-
return f"❌ Error: This model is gated. Please accept the terms on the Hugging Face website: https://huggingface.co/{repo_id}"
533-
else:
534-
return f"❌ Error downloading model: {str(e)}"
520+
# Start download timer
521+
start_time = time.time()
535522

536-
except ImportError:
537-
return "❌ Error: llama-cpp-python not installed. Please install with: pip install llama-cpp-python"
538-
539-
# For Transformers models, use the Transformers library
540-
else:
541523
try:
542524
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
543525

@@ -587,6 +569,54 @@ def download_model(model_type: str) -> str:
587569

588570
except Exception as e:
589571
return f"❌ Error downloading model: {str(e)}"
572+
573+
elif "Ollama" in model_type:
574+
# Extract model name from model_type
575+
if "llama3" in model_type.lower():
576+
model_name = "llama3"
577+
elif "phi-3" in model_type.lower():
578+
model_name = "phi3"
579+
elif "qwen2" in model_type.lower():
580+
model_name = "qwen2"
581+
else:
582+
return "❌ Error: Unknown Ollama model type"
583+
584+
# Use Ollama to pull the model
585+
try:
586+
import ollama
587+
588+
print(f"Pulling Ollama model: {model_name}")
589+
start_time = time.time()
590+
591+
# Pull the model with progress tracking
592+
progress_text = ""
593+
for progress in ollama.pull(model_name, stream=True):
594+
status = progress.get('status')
595+
if status:
596+
progress_text = f"Status: {status}"
597+
print(progress_text)
598+
599+
# Show download progress
600+
if 'completed' in progress and 'total' in progress:
601+
completed = progress['completed']
602+
total = progress['total']
603+
if total > 0:
604+
percent = (completed / total) * 100
605+
progress_text = f"Downloading: {percent:.1f}% ({completed}/{total})"
606+
print(progress_text)
607+
608+
# Calculate download time
609+
download_time = time.time() - start_time
610+
return f"✅ Successfully pulled Ollama model {model_name} in {download_time:.1f} seconds."
611+
612+
except ImportError:
613+
return "❌ Error: ollama not installed. Please install with: pip install ollama"
614+
except ConnectionError:
615+
return "❌ Error: Could not connect to Ollama. Please make sure Ollama is installed and running."
616+
except Exception as e:
617+
return f"❌ Error pulling Ollama model: {str(e)}"
618+
else:
619+
return "❌ Error: Unknown model type"
590620

591621
except Exception as e:
592622
return f"❌ Error: {str(e)}"

0 commit comments

Comments
 (0)