Skip to content

Commit 4903dc3

Browse files
alabulei1juntao
authored andcommitted
Update llm_inference.md
Signed-off-by: alabulei1 <[email protected]> Signed-off-by: Michael Yuan <[email protected]>
1 parent e3811ce commit 4903dc3

File tree

1 file changed

+85
-50
lines changed

1 file changed

+85
-50
lines changed

docs/develop/rust/wasinn/llm_inference.md

Lines changed: 85 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -29,19 +29,33 @@ git clone curl -LO https://huggingface.co/wasmedge/llama2/blob/main/llama-2-7b-c
2929
Run the inference application in WasmEdge.
3030

3131
```bash
32-
wasmedge --dir .:. \
33-
--nn-preload default:GGML:CPU:llama-2-7b.Q5_K_M.gguf llama-chat.wasm default \
34-
--prompt 'Robert Oppenheimer most important achievement is ' \
35-
--ctx-size 4096
32+
wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-chat-q5_k_m.gguf \
33+
llama-chat.wasm --prompt-template llama-2-chat
3634
```
3735

38-
After executing the command, you may need to wait a moment for the input prompt to appear. Once the execution is complete, the following output will be generated.
36+
After executing the command, you may need to wait a moment for the input prompt to appear. You can enter your question once you see the `[USER]:` prompt:
3937

4038
```bash
41-
Robert Oppenheimer most important achievement is
42-
1945 Manhattan Project.
43-
Robert Oppenheimer was born in New York City on April 22, 1904. He was the son of Julius Oppenheimer, a wealthy German-Jewish textile merchant, and Ella Friedman Oppenheimer.
44-
Robert Oppenheimer was a brilliant student. He attended the Ethical Culture School in New York City and graduated from the Ethical Culture Fieldston School in 1921. He then attended Harvard University, where he received his bachelor's degree.
39+
[USER]:
40+
I have two apples, each costing 5 dollars. What is the total cost of these apple
41+
*** [prompt begin] ***
42+
<s>[INST] <<SYS>>
43+
You are a helpful, respectful and honest assistant. Always answer as short as possible, while being safe. <</SYS>>
44+
45+
I have two apples, each costing 5 dollars. What is the total cost of these apple [/INST]
46+
*** [prompt end] ***
47+
[ASSISTANT]:
48+
The total cost of the two apples is 10 dollars.
49+
[USER]:
50+
How about four apples?
51+
*** [prompt begin] ***
52+
<s>[INST] <<SYS>>
53+
You are a helpful, respectful and honest assistant. Always answer as short as possible, while being safe. <</SYS>>
54+
55+
I have two apples, each costing 5 dollars. What is the total cost of these apple [/INST] The total cost of the two apples is 10 dollars. </s><s>[INST] How about four apples? [/INST]
56+
*** [prompt end] ***
57+
[ASSISTANT]:
58+
The total cost of four apples is 20 dollars.
4559
```
4660
4761
## Build and run
@@ -70,57 +84,78 @@ curl -LO https://huggingface.co/wasmedge/llama2/blob/main/llama-2-13b-q5_k_m.ggu
7084
Next, use WasmEdge to load the llama-2-13b model and then ask the model to questions.
7185
7286
```bash
73-
wasmedge --dir .:. \
74-
--nn-preload default:GGML:CPU:llama-2-13b.Q5_K_M.gguf llama-chat.wasm default \
75-
--prompt 'Robert Oppenheimer most important achievement is ' \
76-
--ctx-size 4096
87+
wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-13b-chat-q5_k_m.gguf \
88+
llama-chat.wasm --prompt-template llama-2-chat
7789
```
7890
7991
After executing the command, you may need to wait a moment for the input prompt to appear. You can enter your question once you see the `[USER]:` prompt:
8092
8193
```bash
82-
Robert Oppenheimer most important achievement is
83-
1945 Manhattan Project.
84-
Robert Oppenheimer was born in New York City on April 22, 1904. He was the son of Julius Oppenheimer, a wealthy German-Jewish textile merchant, and Ella Friedman Oppenheimer.
85-
Robert Oppenheimer was a brilliant student. He attended the Ethical Culture School in New York City and graduated from the Ethical Culture Fieldston School in 1921. He then attended Harvard University, where he received his bachelor's degree
94+
[USER]:
95+
Who is Robert Oppenheimer?
96+
*** [prompt begin] ***
97+
<s>[INST] <<SYS>>
98+
You are a helpful, respectful and honest assistant. Always answer as short as possible, while being safe. <</SYS>>
99+
100+
Who is Robert Oppenheimer? [/INST]
101+
*** [prompt end] ***
102+
[ASSISTANT]:
103+
Robert Oppenheimer was an American theoretical physicist and director of the Manhattan Project, which developed the atomic bomb during World War II. He is widely regarded as one of the most important physicists of the 20th century and is known for his contributions to the development of quantum mechanics and the theory of the atomic nucleus. Oppenheimer was also a prominent figure in the post-war nuclear weapons debate and was a strong advocate for international cooperation on nuclear weapons control.
86104
```
87105
88-
## Optional: Configure the model
89-
90-
You can use environment variables to configure the model execution.
106+
## Optional: run the model with different CLI
91107
92-
| Option |Default |Function |
93-
| -------|-----------|----- |
94-
| LLAMA_LOG | 0 |The backend will print diagnostic information when this value is set to 1|
95-
|LLAMA_N_CTX |512| The context length is the max number of tokens in the entire conversation|
96-
|LLAMA_N_PREDICT |512|The number of tokens to generate in each response from the model|
97-
98-
For example, the following command specifies a context length of 4k tokens, which is standard for llama2, and the max number of tokens in each response to be 1k. It also tells WasmEdge to print out logs and statistics of the model at runtime.
108+
We also have CLI options for more information.
99109
110+
```bash
111+
-m, --model-alias <ALIAS>
112+
Model alias [default: default]
113+
-c, --ctx-size <CTX_SIZE>
114+
Size of the prompt context [default: 4096]
115+
-n, --n-predict <N_PRDICT>
116+
Number of tokens to predict [default: 1024]
117+
-g, --n-gpu-layers <N_GPU_LAYERS>
118+
Number of layers to run on the GPU [default: 100]
119+
-b, --batch-size <BATCH_SIZE>
120+
Batch size for prompt processing [default: 4096]
121+
-r, --reverse-prompt <REVERSE_PROMPT>
122+
Halt generation at PROMPT, return control.
123+
-s, --system-prompt <SYSTEM_PROMPT>
124+
System prompt message string [default: "[Default system message for the prompt template]"]
125+
-p, --prompt-template <TEMPLATE>
126+
Prompt template. [default: llama-2-chat] [possible values: llama-2-chat, codellama-instruct, mistral-instruct-v0.1, mistrallite, openchat, belle-llama-2-chat, vicuna-chat, chatml]
127+
--log-prompts
128+
Print prompt strings to stdout
129+
--log-stat
130+
Print statistics to stdout
131+
--log-enable
132+
Print all log information to stdout
133+
--stream-stdout
134+
Print the output to stdout in the streaming way
135+
-h, --help
136+
Print help
100137
```
101-
LLAMA_LOG=1 LLAMA_N_CTX=4096 LLAMA_N_PREDICT=128 wasmedge --dir .:. \
102-
--nn-preload default:GGML:CPU:llama-2-7b.Q5_K_M.gguf llama-simple.wasm default \
103-
--prompt 'Robert Oppenheimer most important achievement is ' \
104-
--ctx-size 4096
105138
106-
...................................................................................................
107-
[2023-10-08 23:13:10.272] [info] [WASI-NN] GGML backend: set n_ctx to 4096
108-
llama_new_context_with_model: kv self size = 2048.00 MB
109-
llama_new_context_with_model: compute buffer total size = 297.47 MB
110-
llama_new_context_with_model: max tensor size = 102.54 MB
111-
[2023-10-08 23:13:10.472] [info] [WASI-NN] GGML backend: llama_system_info: AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
112-
[2023-10-08 23:13:10.472] [info] [WASI-NN] GGML backend: set n_predict to 128
113-
[2023-10-08 23:13:16.014] [info] [WASI-NN] GGML backend: llama_get_kv_cache_token_count 128
139+
For example, the following command tells WasmEdge to print out logs and statistics of the model at runtime.
114140
115-
llama_print_timings: load time = 1431.58 ms
116-
llama_print_timings: sample time = 3.53 ms / 118 runs ( 0.03 ms per token, 33446.71 tokens per second)
117-
llama_print_timings: prompt eval time = 1230.69 ms / 11 tokens ( 111.88 ms per token, 8.94 tokens per second)
118-
llama_print_timings: eval time = 4295.81 ms / 117 runs ( 36.72 ms per token, 27.24 tokens per second)
119-
llama_print_timings: total time = 5742.71 ms
120-
Robert Oppenheimer most important achievement is
121-
1945 Manhattan Project.
122-
Robert Oppenheimer was born in New York City on April 22, 1904. He was the son of Julius Oppenheimer, a wealthy German-Jewish textile merchant, and Ella Friedman Oppenheimer.
123-
Robert Oppenheimer was a brilliant student. He attended the Ethical Culture School in New York City and graduated from the Ethical Culture Fieldston School in 1921. He then attended Harvard University, where he received his bachelor's degree.
141+
```
142+
wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-chat-q5_k_m.gguf \
143+
llama-chat.wasm --prompt-template llama-2-chat --log-enable
144+
..................................................................................................
145+
llama_new_context_with_model: n_ctx = 512
146+
llama_new_context_with_model: freq_base = 10000.0
147+
llama_new_context_with_model: freq_scale = 1
148+
llama_new_context_with_model: kv self size = 256.00 MB
149+
llama_new_context_with_model: compute buffer total size = 76.63 MB
150+
[2023-11-07 02:07:44.019] [info] [WASI-NN] GGML backend: llama_system_info: AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
151+
152+
llama_print_timings: load time = 11523.19 ms
153+
llama_print_timings: sample time = 2.62 ms / 102 runs ( 0.03 ms per token, 38961.04 tokens per second)
154+
llama_print_timings: prompt eval time = 11479.27 ms / 49 tokens ( 234.27 ms per token, 4.27 tokens per second)
155+
llama_print_timings: eval time = 13571.37 ms / 101 runs ( 134.37 ms per token, 7.44 tokens per second)
156+
llama_print_timings: total time = 25104.57 ms
157+
[ASSISTANT]:
158+
Ah, a fellow Peanuts enthusiast! Snoopy is Charlie Brown's lovable and imaginative beagle, known for his wild and wacky adventures in the comic strip and television specials. He's a loyal companion to Charlie Brown and the rest of the Peanuts gang, and his antics often provide comic relief in the series. Is there anything else you'd like to know about Snoopy? 🐶
124159
```
125160
126161
## Improve performance
@@ -216,7 +251,7 @@ Next, The prompt is converted into bytes and set as the input tensor for the mod
216251
.expect("Failed to set prompt as the input tensor");
217252
```
218253
219-
Next, excute the model inference.
254+
Next, execute the model inference.
220255
221256
```rust
222257
// execute the inference
@@ -241,7 +276,7 @@ println!("\nprompt: {}", &prompt);
241276
println!("\noutput: {}", output);
242277
```
243278
244-
The code explanation above is simple one time chat with llama 2 model. But we have more!
279+
The code explanation above is simple [one time chat with llama 2 model](https://github.com/second-state/llama-utils/tree/main/simple). But we have more!
245280
246281
* If you're looking for continuous conversations with llama 2 models, please check out the source code [here](https://github.com/second-state/llama-utils/tree/main/chat).
247282
* If you want to construct OpenAI-compatible APIs specifically for your llama2 model, or the Llama2 model itself, please check out the surce code [here](https://github.com/second-state/llama-utils/tree/main/api-server).

0 commit comments

Comments
 (0)