You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
You can also find a complete chat-app in [examples/simple-chat](examples/simple-chat/).
73
+
You can also find a complete chatapp in [examples/simple-chat](examples/simple-chat/).
74
74
75
75
## Customized Model Weights
76
76
77
-
WebLLM works a companion project of [MLC LLM](https://github.com/mlc-ai/mlc-llm).
78
-
It reuses the model artifact and build flow of MLC LLM, please checkout MLC LLM document
79
-
on how to build a new model weights and libraries (MLC LLM document will come in the incoming weeks).
77
+
WebLLM works as a companion project of [MLC LLM](https://github.com/mlc-ai/mlc-llm).
78
+
It reuses the model artifact and builds flow of MLC LLM, please check out MLC LLM document
79
+
on how to build new model weights and libraries (MLC LLM document will come in the incoming weeks).
80
80
To generate the wasm needed by WebLLM, you can run with `--target webgpu` in the mlc llm build.
81
-
There are two elements of WebLLM package that enables new models and weight variants.
81
+
There are two elements of the WebLLM package that enables new models and weight variants.
82
82
83
83
- model_url: Contains a URL to model artifacts, such as weights and meta-data.
84
-
- model_lib: The webassembly libary that contains the executables to accelerate the model computations.
84
+
- model_lib: The web assembly libary that contains the executables to accelerate the model computations.
85
85
86
86
Both are customizable in the WebLLM.
87
87
88
88
```typescript
89
89
asyncmain() {
90
90
const myLlamaUrl ="/url/to/my/llama";
91
91
const appConfig = {
92
-
"model_list": [
93
-
{
94
-
"model_url": myLlamaUrl,
95
-
"local_id": "MyLlama-3b-v1-q4f32_0"
96
-
}
97
-
],
98
-
"model_lib_map": {
99
-
"llama-v1-3b-q4f32_0": "/url/to/myllama3b.wasm",
100
-
}
92
+
"model_list": [
93
+
{
94
+
"model_url": myLlamaUrl,
95
+
"local_id": "MyLlama-3b-v1-q4f32_0"
96
+
}
97
+
],
98
+
"model_lib_map": {
99
+
"llama-v1-3b-q4f32_0": "/url/to/myllama3b.wasm",
101
100
};
102
101
// override default
103
102
constchatOpts= {
@@ -117,10 +116,10 @@ async main() {
117
116
}
118
117
```
119
118
120
-
In many cases we only want to supply the model weight variant, but
119
+
In many cases, we only want to supply the model weight variant, but
121
120
not necessarily a new model. In such cases, we can reuse the model lib.
122
121
In such cases, we can just pass in the `model_list` field and skip the model lib,
123
-
and make sure the `mlc-chat-config.json` in the model url have a model lib
122
+
and make sure the `mlc-chat-config.json` in the model url has a model lib
124
123
that points to a prebuilt version, right now the prebuilt lib includes
125
124
126
125
- `vicuna-v1-7b-q4f32_0`: llama-7b models.
@@ -131,16 +130,16 @@ that points to a prebuilt version, right now the prebuilt lib includes
131
130
132
131
WebLLM package is a web runtime designed for [MLC LLM](https://github.com/mlc-ai/mlc-llm).
133
132
134
-
1. Install all the prerequisite for web deployment:
135
-
1.[emscripten](https://emscripten.org). It is an LLVM-based compiler which compiles C/C++ source code to WebAssembly.
133
+
1. Install all the prerequisites for compilation:
134
+
1. [emscripten](https://emscripten.org). It is an LLVM-based compiler that compiles C/C++ source code to WebAssembly.
136
135
- Follow the [installation instruction](https://emscripten.org/docs/getting_started/downloads.html#installation-instructions-using-the-emsdk-recommended) to install the latest emsdk.
137
136
- Source `emsdk_env.sh` by `sourcepath/to/emsdk_env.sh`, so that `emcc` is reachable from PATH and the command `emcc` works.
138
137
4. Install jekyll by following the [official guides](https://jekyllrb.com/docs/installation/). It is the package we use for website.
139
138
5. Install jekyll-remote-theme by command. Try [gem mirror](https://gems.ruby-china.com/) if install blocked.
140
139
```shell
141
140
geminstalljekyll-remote-theme
142
141
```
143
-
We can verify the success installation by trying out `emcc` and `jekyll`in terminal respectively.
142
+
We can verify the successful installation by trying out `emcc` and `jekyll` in terminal, respectively.
144
143
145
144
2. Setup necessary environment
146
145
@@ -155,40 +154,14 @@ WebLLM package is a web runtime designed for [MLC LLM](https://github.com/mlc-ai
155
154
npmrunbuild
156
155
```
157
156
158
-
4. Validate some of the subpackages
157
+
4. Validate some of the sub-packages
159
158
160
-
You can then go to the subfolders in [examples] to validate some of the subpackages.
161
-
We use Parcelv2 for bundling. Although parcel is not very good at tracking parent directory
162
-
changes sometimes. When you made a change in the WebLLM package, try to edit the `package.json`
159
+
You can then go to the subfolders in [examples] to validate some of the sub-packages.
160
+
We use Parcelv2 for bundling. Although Parcel is not very good at tracking parent directory
161
+
changes sometimes. When you make a change in the WebLLM package, try to edit the `package.json`
163
162
of the subfolder and save it, which will trigger Parcel to rebuild.
164
163
165
164
166
-
## How
167
-
168
-
The key technology here is machine learning compilation (MLC). Our solution builds on the shoulders of the open source ecosystem, including Hugging Face, model variants from LLaMA and Vicuna, wasm and WebGPU. The main flow builds on Apache TVM Unity, an exciting ongoing development in the [Apache TVM Community](https://github.com/apache/tvm/).
169
-
170
-
- We bake a language model's IRModule in TVM with native dynamic shape support, avoiding the need of padding to max length and reducing both computation amount and memory usage.
171
-
- Each function in TVM’s IRModule can be further transformed and generate runnable code that can be deployed universally on any environment that is supported by minimum tvm runtime (JavaScript being one of them).
172
-
- [TensorIR](https://arxiv.org/abs/2207.04296) is the key technique used to generate optimized programs. We provide productive solutions by quickly transforming TensorIR programs based on the combination of expert knowledge and automated scheduler.
173
-
- Heuristics are used when optimizing light-weight operators in order to reduce the engineering pressure.
174
-
- We utilize int4 quantization techniques to compress the model weights so that they can fit into memory.
175
-
- We build static memory planning optimizations to reuse memory across multiple layers.
176
-
- We use [Emscripten](https://emscripten.org/) and TypeScript to build a TVM web runtime that can deploy generated modules.
177
-
- We also leveraged a wasm port of SentencePiece tokenizer.
All parts of this workflow are done in Python, with the exception of course, of the last part that builds a 600 loc JavaScript app that connects things together. This is also a fun process of interactive development, bringing new models.
182
-
183
-
All these are made possible by the open-source ecosystem that we leverage. Specifically, we make heavy use of [TVM unity](https://discuss.tvm.apache.org/t/establish-tvm-unity-connection-a-technical-strategy/13344), an exciting latest development in the TVM project that enables such Python-first interactive MLC development experiences that allows us to easily compose new optimizations, all in Python, and incrementally bring our app to the web.
184
-
185
-
TVM unity also provides an easy way to compose new solutions in the ecosystem. We will continue to bring further optimizations such as fused quantization kernels, and bring them to more platforms.
186
-
187
-
One key characteristic of LLM models is the dynamic nature of the model. As the decoding and prefill process depends on computations that grow with the size of tokens, we leverage the first-class dynamic shape support in TVM unity that represents sequence dimensions through symbolic integers. This allows us to plan ahead to statically allocate all the memory needed for the sequence window of interest without padding.
188
-
189
-
We also leveraged the integration of tensor expressions to quickly express partial-tensor computations such as rotary embedding directly without materializing them into full-tensor matrix computations.
190
-
191
-
192
165
## Links
193
166
194
167
- [Demo page](https://mlc.ai/web-llm/)
@@ -199,4 +172,4 @@ We also leveraged the integration of tensor expressions to quickly express parti
199
172
200
173
This project is initiated by members from CMU catalyst, UW SAMPL, SJTU, OctoML and the MLC community. We would love to continue developing and supporting the open-source ML community.
201
174
202
-
This project is only possible thanks to the shoulders open-source ecosystems that we stand on. We want to thank the Apache TVM community and developers of the TVM Unity effort. The open-source ML community members made these models publicly available. PyTorch and Hugging Face communities that make these models accessible. We would like to thank the teams behind vicuna, SentencePiece, LLaMA, Alpaca. We also would like to thank the WebAssembly, Emscripten, and WebGPU communities. Finally, thanks to Dawn and WebGPU developers.
175
+
This project is only possible thanks to the shoulders open-source ecosystems that we stand on. We want to thank the Apache TVM community and developers of the TVM Unity effort. The open-source ML community members made these models publicly available. PyTorch and Hugging Face communities make these models accessible. We would like to thank the teams behind vicuna, SentencePiece, LLaMA, Alpaca. We also would like to thank the WebAssembly, Emscripten, and WebGPU communities. Finally, thanks to Dawn and WebGPU developers.
0 commit comments