@@ -164,11 +164,14 @@ Check models below.
164
164
165
165
## Download Model Files
166
166
167
- Download ` FP16 ` quantized .gguf files from:
167
+ Download ` FP16 ` quantized ` Llama-3 ` .gguf files from:
168
168
- https://huggingface.co/beehive-lab/Llama-3.2-1B-Instruct-GGUF-FP16
169
169
- https://huggingface.co/beehive-lab/Llama-3.2-3B-Instruct-GGUF-FP16
170
170
- https://huggingface.co/beehive-lab/Llama-3.2-8B-Instruct-GGUF-FP16
171
171
172
+ Download ` FP16 ` quantized ` Mistral ` .gguf files from:
173
+ - https://huggingface.co/collections/beehive-lab/mistral-gpullama3java-684afabb206136d2e9cd47e0
174
+
172
175
Please be gentle with [ huggingface.co] ( https://huggingface.co ) servers:
173
176
174
177
** Note** FP16 models are first-class citizens for the current version.
@@ -181,6 +184,9 @@ wget https://huggingface.co/beehive-lab/Llama-3.2-3B-Instruct-GGUF-FP16/resolve/
181
184
182
185
# Llama 3 (8B) - FP16
183
186
wget https://huggingface.co/beehive-lab/Llama-3.2-8B-Instruct-GGUF-FP16/resolve/main/beehive-llama-3.2-8b-instruct-fp16.gguf
187
+
188
+ # Mistral (7B) - FP16
189
+ wget https://huggingface.co/MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF/resolve/main/Mistral-7B-Instruct-v0.3.fp16.gguf
184
190
```
185
191
186
192
** [ Experimental] ** you can download the Q8 and Q4 used in the original implementation of Llama3.java, but for now are going to be dequanted to FP16 for TornadoVM support:
@@ -201,7 +207,7 @@ curl -L -O https://huggingface.co/mukel/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/
201
207
202
208
## Running ` llama-tornado `
203
209
204
- To execute Llama3 models with TornadoVM on GPUs use the ` llama-tornado ` script with the ` --gpu ` flag.
210
+ To execute Llama3, or Mistral models with TornadoVM on GPUs use the ` llama-tornado ` script with the ` --gpu ` flag.
205
211
206
212
### Usage Examples
207
213
@@ -246,11 +252,11 @@ First, check your GPU specifications. If your GPU has high memory capacity, you
246
252
247
253
### GPU Memory Requirements by Model Size
248
254
249
- | Model Size | Recommended GPU Memory |
250
- | ------------| ------------------------|
251
- | 1B models | 7GB (default) |
252
- | 3B models | 15GB |
253
- | 8B models | 20GB+ |
255
+ | Model Size | Recommended GPU Memory |
256
+ | ------------- | ------------------------|
257
+ | 1B models | 7GB (default) |
258
+ | 3-7B models | 15GB |
259
+ | 8B models | 20GB+ |
254
260
255
261
** Note** : If you still encounter memory issues, try:
256
262
@@ -288,6 +294,7 @@ LLaMA Configuration:
288
294
Maximum number of tokens to generate (default: 512)
289
295
--stream STREAM Enable streaming output (default: True)
290
296
--echo ECHO Echo the input prompt (default: False)
297
+ --suffix SUFFIX Suffix for fill-in-the-middle request (Codestral) (default: None)
291
298
292
299
Mode Selection:
293
300
-i, --interactive Run in interactive/chat mode (default: False)
0 commit comments