You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -43,7 +43,7 @@ Here you can find a result of using Chain of Thought (CoT) reasoning:
43
43
- GPU with 8GB VRAM recommended for better performance
44
44
- Will run on CPU if GPU is not available, but will be significantly slower.
45
45
- For quantized models (4-bit/8-bit): Reduced VRAM requirements (4-6GB) with minimal performance impact
46
-
- For GGUF models: Further reduced memory requirements, with models automatically optimizing GPU usage based on available VRAM
46
+
- For Ollama models: Requires Ollama to be installed and running, with significantly reduced memory requirements
47
47
48
48
### Setup
49
49
@@ -55,7 +55,7 @@ Here you can find a result of using Chain of Thought (CoT) reasoning:
55
55
pip install -r requirements.txt
56
56
```
57
57
58
-
2. Authenticate with HuggingFace:
58
+
2. Authenticate with HuggingFace (for Hugging Face models only):
59
59
60
60
The system uses `Mistral-7B` by default, which requires authentication with HuggingFace:
61
61
@@ -78,14 +78,29 @@ Here you can find a result of using Chain of Thought (CoT) reasoning:
78
78
79
79
If no API key is provided, the system will automatically download and use `Mistral-7B-Instruct-v0.2` for text generation when using the local model. No additional configuration is needed.
80
80
81
-
2. For quantized models and GGUF support, ensure these packages are installed:
81
+
4. For quantized models, ensure bitsandbytes is installed:
3. To run `llama-cpp` models on Windows, you will need to install Visual C++ Build Tools, and install the C++ development-related components throughout the installation procedure. Choosing the "C++ build tools" and any necessary libraries or packages is usually part of this.
87
+
5. For Ollama models, install Ollama:
88
+
89
+
a. Download and install Ollama from [ollama.com/download](https://ollama.com/download) for Windows, or run the following command in Linux:
90
+
91
+
```bash
92
+
curl -fsSL https://ollama.com/install.sh | sh
93
+
```
94
+
95
+
b. Start the Ollama service
96
+
97
+
c. Pull the models you want to use beforehand:
98
+
99
+
```bash
100
+
ollama pull llama3
101
+
ollama pull phi3
102
+
ollama pull qwen2
103
+
```
89
104
90
105
## 1. Getting Started
91
106
@@ -129,16 +144,17 @@ This will start the Gradio server and automatically open the interface in your d
129
144
130
145
3. **Chat Interface**:
131
146
- Select between different model options:
132
-
- Local (Mistral) - Default Mistral-7B model
147
+
- Local (Mistral) - Default Mistral-7B model (recommended)
133
148
- Local (Mistral) with 4-bit or 8-bit quantization for faster inference
134
-
- GGUF models (Phi-4-mini, Qwen_QwQ-32B, TinyR1-32B) for optimized performance
149
+
- Ollama models (llama3, phi-3, qwen2) as alternative options
135
150
- OpenAI (if API key is configured)
136
151
- Toggle Chain of Thought reasoning for more detailed responses
137
152
- Chat with your documents using natural language
138
153
- Clear chat history as needed
139
154
140
155
Note: The interface will automatically detect available models based on your configuration:
141
-
- Local Mistral model requires HuggingFace token in `config.yaml`
156
+
- Local Mistral model requires HuggingFace token in `config.yaml` (default option)
157
+
- Ollama models require Ollama to be installed and running (alternative options)
142
158
- OpenAI model requires API key in `.env` file
143
159
144
160
### 3. Using Individual Python Components via Command Line
0 commit comments