Skip to content

Commit 0a9e2e5

Browse files
authored
Add instructions for local deployment (#55)
1 parent cb76f41 commit 0a9e2e5

File tree

1 file changed

+68
-0
lines changed

1 file changed

+68
-0
lines changed

README.md

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,74 @@ We also do not want to only do it for just one model. Instead, we would like to
2424

2525
Besides supporting WebGPU, this project also provides the harness for other kinds of GPU backends that TVM supports (such as CUDA, OpenCL, and Vulkan) and really enables accessible deployment of LLM models.
2626

27+
## Instructions for local deployment
28+
29+
1. Install TVM Unity.
30+
31+
```shell
32+
pip3 install mlc-ai-nightly -f https://mlc.ai/wheels
33+
```
34+
35+
2. Install all the prerequisite for web deployment:
36+
1. [emscripten](https://emscripten.org). It is an LLVM-based compiler which compiles C/C++ source code to WebAssembly.
37+
- Follow the [installation instruction](https://emscripten.org/docs/getting_started/downloads.html#installation-instructions-using-the-emsdk-recommended) to install the latest emsdk.
38+
- Source `emsdk_env.sh` by `source path/to/emsdk_env.sh`, so that `emcc` is reachable from PATH and the command `emcc` works.
39+
2. [Rust](https://www.rust-lang.org/tools/install).
40+
3. [`wasm-pack`](https://rustwasm.github.io/wasm-pack/installer/). It helps build Rust-generated WebAssembly, which used for tokenizer in our case here.
41+
4. Install jekyll by following the [official guides](https://jekyllrb.com/docs/installation/). It is the package we use for website.
42+
5. Install jekyll-remote-theme by command
43+
```shell
44+
gem install jekyll-remote-theme
45+
```
46+
6. Install [Chrome Canary](https://www.google.com/chrome/canary/). It is a developer version of Chrome that enables the use of WebGPU.
47+
48+
We can verify the success installation by trying out `emcc`, `jekyll` and `wasm-pack` in terminal respectively.
49+
50+
3. Import, optimize and build the LLM model:
51+
* Get Model Weight
52+
53+
Currently we support LLaMA and Vicuna.
54+
55+
1. Get the original LLaMA weights in the huggingface format by following the instructions [here](https://huggingface.co/docs/transformers/main/model_doc/llama).
56+
2. Use instructions [here](https://github.com/lm-sys/FastChat#vicuna-weights) to get vicuna weights.
57+
3. Create a soft link to the model path under dist/models
58+
```shell
59+
mkdir -p dist/models
60+
ln -s your_model_path dist/models/model_name
61+
62+
# For example:
63+
# ln -s path/to/vicuna-7b-v1 dist/models/vicuna-7b-v1
64+
```
65+
* Optimize and build model to webgpu backend and export the executable to disk in the WebAssembly file format.
66+
67+
68+
```shell
69+
python3 build.py --target webgpu
70+
```
71+
By default `build.py` takes `vicuna-7b-v1` as model name. You can also specify model name as
72+
```shell
73+
python3 build.py --target webgpu --model llama-7b
74+
```
75+
Note: build.py can be run on MacOS with 32GB memory and other OS with at least 50GB CPU memory. We are currently optimizing the memory usage to enable more people to try out locally.
76+
77+
4. Deploy the model on web with WebGPU runtime
78+
79+
Prepare all the necessary dependencies for web build:
80+
```shell
81+
./scripts/prep_deps.sh
82+
```
83+
84+
The last thing to do is setting up the site with
85+
```shell
86+
./scripts/local_deploy_site.sh
87+
```
88+
89+
With the site set up, you can go to `localhost:8888/web-llm/` in Chrome Canary to try out the demo on your local machine. Remember: you will need 6.4G GPU memory to run the demo. Don’t forget to use
90+
```shell
91+
/Applications/Google\ Chrome\ Canary.app/Contents/MacOS/Google\ Chrome\ Canary --enable-dawn-features=disable_robustness
92+
```
93+
to launch Chrome Canary to turn off the robustness check from Chrome.
94+
2795

2896
## How
2997

0 commit comments

Comments
 (0)