|
3 | 3 | CLI
|
4 | 4 | ===============
|
5 | 5 |
|
6 |
| -MLCChat CLI is the command line tool to run MLC-compiled LLMs out of the box. |
| 6 | +MLC Chat CLI is the command line tool to run MLC-compiled LLMs out of the box interactively. |
7 | 7 |
|
8 | 8 | .. contents:: Table of Contents
|
9 | 9 | :local:
|
10 | 10 | :depth: 2
|
11 | 11 |
|
12 |
| -Option 1. Conda Prebuilt |
13 |
| -~~~~~~~~~~~~~~~~~~~~~~~~ |
| 12 | +Install MLC-LLM Package |
| 13 | +------------------------ |
14 | 14 |
|
15 |
| -The prebuilt package supports Metal on macOS and Vulkan on Linux and Windows, and can be installed via Conda one-liner. |
| 15 | +Chat CLI is a part of the MLC-LLM package. |
| 16 | +To use the chat CLI, first install MLC LLM by following the instructions :ref:`here <install-mlc-packages>`. |
| 17 | +Once you have install the MLC-LLM package, you can run the following command to check if the installation was successful: |
16 | 18 |
|
17 |
| -To use other GPU runtimes, e.g. CUDA, please instead :ref:`build it from source <mlcchat_build_from_source>`. |
| 19 | +.. code:: bash |
18 | 20 |
|
19 |
| -.. code:: shell |
| 21 | + mlc_llm chat --help |
20 | 22 |
|
21 |
| - conda activate your-environment |
22 |
| - python3 -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly mlc-ai-nightly |
23 |
| - mlc_llm chat -h |
| 23 | +You should see serve help message if the installation was successful. |
24 | 24 |
|
25 |
| -.. note:: |
26 |
| - The prebuilt package supports **Metal** on macOS and **Vulkan** on Linux and Windows. It is possible to use other GPU runtimes such as **CUDA** by compiling MLCChat CLI from the source. |
27 |
| - |
28 |
| - |
29 |
| -Option 2. Build MLC Runtime from Source |
30 |
| -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
31 |
| - |
32 |
| -We also provide options to build mlc runtime libraries and ``mlc_llm`` from source. |
33 |
| -This step is useful if the prebuilt is unavailable on your platform, or if you would like to build a runtime |
34 |
| -that supports other GPU runtime than the prebuilt version. We can build a customized version |
35 |
| -of mlc chat runtime. You only need to do this if you choose not to use the prebuilt. |
36 |
| - |
37 |
| -First, make sure you install TVM unity (following the instruction in :ref:`install-tvm-unity`). |
38 |
| -Then please follow the instructions in :ref:`mlcchat_build_from_source` to build the necessary libraries. |
39 |
| - |
40 |
| -.. `|` adds a blank line |
41 |
| -
|
42 |
| -| |
| 25 | +Quick Start |
| 26 | +------------ |
43 | 27 |
|
44 |
| -Run Models through MLCChat CLI |
45 |
| -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 28 | +This section provides a quick start guide to work with MLC-LLM chat CLI. |
| 29 | +To launch the CLI session, run the following command: |
46 | 30 |
|
47 |
| -Once ``mlc_llm`` is installed, you are able to run any MLC-compiled model on the command line. |
| 31 | +.. code:: bash |
48 | 32 |
|
49 |
| -To run a model with MLC LLM in any platform, you can either: |
| 33 | + mlc_llm chat MODEL [--model-lib PATH-TO-MODEL-LIB] |
50 | 34 |
|
51 |
| -- Use off-the-shelf model prebuilts from the MLC Huggingface repo (see :ref:`Model Prebuilts` for details). |
52 |
| -- Use locally compiled model weights and libraries following :doc:`the model compilation page </compilation/compile_models>`. |
| 35 | +where ``MODEL`` is the model folder after compiling with :ref:`MLC-LLM build process <compile-model-libraries>`. Information about other arguments can be found in the next section. |
53 | 36 |
|
54 |
| -**Option 1: Use model prebuilts** |
55 |
| - |
56 |
| -To run ``mlc_llm``, you can specify the Huggingface MLC prebuilt model repo path with the prefix ``HF://``. |
57 |
| -For example, to run the MLC Llama 3 8B Q4F16_1 model (`Repo link <https://huggingface.co/mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC>`_), |
58 |
| -simply use ``HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC``. The model weights and library will be downloaded |
59 |
| -automatically from Huggingface. |
60 |
| - |
61 |
| -.. code:: shell |
62 |
| -
|
63 |
| - mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC --device "cuda:0" --overrides context_window_size=1024 |
| 37 | +Once the chat CLI is ready, you can enter the prompt to interact with the model. |
64 | 38 |
|
65 | 39 | .. code::
|
66 | 40 |
|
67 | 41 | You can use the following special commands:
|
68 | 42 | /help print the special commands
|
69 | 43 | /exit quit the cli
|
70 |
| - /stats print out the latest stats (token/sec) |
| 44 | + /stats print out stats of last request (token/sec) |
| 45 | + /metrics print out full engine metrics |
71 | 46 | /reset restart a fresh chat
|
72 | 47 | /set [overrides] override settings in the generation config. For example,
|
73 |
| - `/set temperature=0.5;max_gen_len=100;stop=end,stop` |
| 48 | + `/set temperature=0.5;top_p=0.8;seed=23;max_tokens=100;stop=str1,str2` |
74 | 49 | Note: Separate stop words in the `stop` option with commas (,).
|
75 | 50 | Multi-line input: Use escape+enter to start a new line.
|
76 | 51 |
|
77 |
| - user: What's the meaning of life |
78 |
| - assistant: |
79 |
| - What a profound and intriguing question! While there's no one definitive answer, I'd be happy to help you explore some perspectives on the meaning of life. |
| 52 | + >>> What's the meaning of life? |
| 53 | + The meaning of life is a philosophical and metaphysical question related to the purpose or significance of life or existence in general... |
| 54 | +
|
| 55 | +.. note:: |
| 56 | + |
| 57 | + If you want to enable tensor parallelism to run LLMs on multiple GPUs, |
| 58 | + please specify argument ``--overrides "tensor_parallel_shards=$NGPU"``. |
| 59 | + For example, |
| 60 | + |
| 61 | + .. code:: shell |
80 | 62 |
|
81 |
| - The concept of the meaning of life has been debated and... |
| 63 | + mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC --overrides "tensor_parallel_shards=2" |
82 | 64 |
|
83 | 65 |
|
84 |
| -**Option 2: Use locally compiled model weights and libraries** |
| 66 | +The ``mlc_llm chat`` Command |
| 67 | +---------------------------- |
85 | 68 |
|
86 |
| -For models other than the prebuilt ones we provided: |
| 69 | +We provide the list of chat CLI interface for reference. |
87 | 70 |
|
88 |
| -1. If the model is a variant to an existing model library (e.g. ``WizardMathV1.1`` and ``OpenHermes`` are variants of ``Mistral``), |
89 |
| - follow :ref:`convert-weights-via-MLC` to convert the weights and reuse existing model libraries. |
90 |
| -2. Otherwise, follow :ref:`compile-model-libraries` to compile both the model library and weights. |
| 71 | +.. code:: bash |
91 | 72 |
|
92 |
| -Once you have the model locally compiled with a model library and model weights, to run ``mlc_llm``, simply |
| 73 | + mlc_llm serve MODEL [--model-lib PATH-TO-MODEL-LIB] [--device DEVICE] [--overrides OVERRIDES] |
93 | 74 |
|
94 |
| -- Specify the path to ``mlc-chat-config.json`` and the converted model weights to ``--model`` |
95 |
| -- Specify the path to the compiled model library (e.g. a .so file) to ``--model-lib`` |
96 | 75 |
|
97 |
| -.. code:: shell |
| 76 | +MODEL The model folder after compiling with MLC-LLM build process. The parameter |
| 77 | + can either be the model name with its quantization scheme |
| 78 | + (e.g. ``Llama-2-7b-chat-hf-q4f16_1``), or a full path to the model |
| 79 | + folder. In the former case, we will use the provided name to search |
| 80 | + for the model folder over possible paths. |
98 | 81 |
|
99 |
| - mlc_llm chat dist/Llama-2-7b-chat-hf-q4f16_1-MLC \ |
100 |
| - --device "cuda:0" --overrides context_window_size=1024 \ |
101 |
| - --model-lib dist/prebuilt_libs/Llama-2-7b-chat-hf/Llama-2-7b-chat-hf-q4f16_1-vulkan.so |
102 |
| - # CUDA on Linux: dist/prebuilt_libs/Llama-2-7b-chat-hf/Llama-2-7b-chat-hf-q4f16_1-cuda.so |
103 |
| - # Metal on macOS: dist/prebuilt_libs/Llama-2-7b-chat-hf/Llama-2-7b-chat-hf-q4f16_1-metal.so |
104 |
| - # Same rule applies for other platforms |
| 82 | +--model-lib A field to specify the full path to the model library file to use (e.g. a ``.so`` file). |
| 83 | +--device The description of the device to run on. User should provide a string in the |
| 84 | + form of ``device_name:device_id`` or ``device_name``, where ``device_name`` is one of |
| 85 | + ``cuda``, ``metal``, ``vulkan``, ``rocm``, ``opencl``, ``auto`` (automatically detect the |
| 86 | + local device), and ``device_id`` is the device id to run on. The default value is ``auto``, |
| 87 | + with the device id set to 0 for default. |
| 88 | +--overrides Model configuration override. Supports overriding |
| 89 | + ``context_window_size``, ``prefill_chunk_size``, ``sliding_window_size``, ``attention_sink_size``, |
| 90 | + ``max_batch_size`` and ``tensor_parallel_shards``. The overrides could be explicitly |
| 91 | + specified via details knobs, e.g. --overrides ``context_window_size=1024;prefill_chunk_size=128``. |
0 commit comments