You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+68Lines changed: 68 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -24,6 +24,74 @@ We also do not want to only do it for just one model. Instead, we would like to
24
24
25
25
Besides supporting WebGPU, this project also provides the harness for other kinds of GPU backends that TVM supports (such as CUDA, OpenCL, and Vulkan) and really enables accessible deployment of LLM models.
2. Install all the prerequisite for web deployment:
36
+
1. [emscripten](https://emscripten.org). It is an LLVM-based compiler which compiles C/C++ source code to WebAssembly.
37
+
- Follow the [installation instruction](https://emscripten.org/docs/getting_started/downloads.html#installation-instructions-using-the-emsdk-recommended) to install the latest emsdk.
38
+
- Source `emsdk_env.sh` by `source path/to/emsdk_env.sh`, so that `emcc` is reachable from PATH and the command`emcc` works.
3. [`wasm-pack`](https://rustwasm.github.io/wasm-pack/installer/). It helps build Rust-generated WebAssembly, which used fortokenizerin our case here.
41
+
4. Install jekyll by following the [official guides](https://jekyllrb.com/docs/installation/). It is the package we use for website.
42
+
5. Install jekyll-remote-theme by command
43
+
```shell
44
+
gem install jekyll-remote-theme
45
+
```
46
+
6. Install [Chrome Canary](https://www.google.com/chrome/canary/). It is a developer version of Chrome that enables the use of WebGPU.
47
+
48
+
We can verify the success installation by trying out `emcc`, `jekyll` and `wasm-pack`in terminal respectively.
49
+
50
+
3. Import, optimize and build the LLM model:
51
+
* Get Model Weight
52
+
53
+
Currently we support LLaMA and Vicuna.
54
+
55
+
1. Get the original LLaMA weights in the huggingface format by following the instructions [here](https://huggingface.co/docs/transformers/main/model_doc/llama).
56
+
2. Use instructions [here](https://github.com/lm-sys/FastChat#vicuna-weights) to get vicuna weights.
57
+
3. Create a soft link to the model path under dist/models
* Optimize and build model to webgpu backend and export the executable to disk in the WebAssembly file format.
66
+
67
+
68
+
```shell
69
+
python3 build.py --target webgpu
70
+
```
71
+
By default `build.py` takes `vicuna-7b-v1` as model name. You can also specify model name as
72
+
```shell
73
+
python3 build.py --target webgpu --model llama-7b
74
+
```
75
+
Note: build.py can be run on MacOS with 32GB memory and other OS with at least 50GB CPU memory. We are currently optimizing the memory usage to enable more people to try out locally.
76
+
77
+
4. Deploy the model on web with WebGPU runtime
78
+
79
+
Prepare all the necessary dependencies for web build:
80
+
```shell
81
+
./scripts/prep_deps.sh
82
+
```
83
+
84
+
The last thing to do is setting up the site with
85
+
```shell
86
+
./scripts/local_deploy_site.sh
87
+
```
88
+
89
+
With the site set up, you can go to `localhost:8888/web-llm/`in Chrome Canary to try out the demo on your local machine. Remember: you will need 6.4G GPU memory to run the demo. Don’t forget to use
0 commit comments