|
| 1 | +# Text Generation |
| 2 | + |
| 3 | +We provide the inference benchmarking scripts for large language models text generation.<br /> |
| 4 | +Support large language model families, including LLaMA 2, GPT-J, OPT, and Bloom.<br /> |
| 5 | +The scripts include both single instance and distributed (DeepSpeed) use cases.<br /> |
| 6 | +The scripts cover model generation inference with low precions cases for different models with best perf and accuracy (fp16 AMP and weight only quantization).<br /> |
| 7 | + |
| 8 | +# Supported Model List |
| 9 | + |
| 10 | +| MODEL FAMILY | Verified < MODEL ID > (Huggingface hub)| FP16 | Weight only quantization INT4 | |
| 11 | +|---|:---:|:---:|:---:| |
| 12 | +|LLAMA 2| "meta-llama/Llama-2-7b-hf", "meta-llama/Llama-2-13b-hf", "meta-llama/Llama-2-70b-hf" | ✅ | ❎ | |
| 13 | +|GPT-J| "EleutherAI/gpt-j-6b" | ✅ | ✅ | |
| 14 | +|OPT|"facebook/opt-6.7b", "facebook/opt-30b"| ✅ | ❎ | |
| 15 | +|Bloom|"bigscience/bloom-7b1", "bigscience/bloom"| ✅ | ❎ | |
| 16 | + |
| 17 | +*Note*: The above verified models (including other models in the same model family, like "codellama/CodeLlama-7b-hf" from LLAMA family) are well supported with all optimizations like indirect access KV cache and fused ROPE. For other LLM model families, we are working in progress to cover those optimizations, which will expand the model list above. |
| 18 | + |
| 19 | +# Supported platform |
| 20 | + |
| 21 | +\* PVC(1550/1100): support all the models in model list<br /> |
| 22 | +\* ATS-M, Arc: support GPT-J (EleutherAI/gpt-j-6b) |
| 23 | + |
| 24 | +# Environment Setup |
| 25 | + |
| 26 | +1. Get the Intel® Extension for PyTorch\* source code |
| 27 | + |
| 28 | +```bash |
| 29 | +git clone https://github.com/intel/intel-extension-for-pytorch.git |
| 30 | +cd intel-extension-for-pytorch |
| 31 | +git checkout v2.1.10+xpu |
| 32 | +git submodule sync |
| 33 | +git submodule update --init --recursive |
| 34 | +``` |
| 35 | + |
| 36 | +2.a. It is highly recommended to build a Docker container from the provided `Dockerfile` for single-instance executions. |
| 37 | + |
| 38 | +```bash |
| 39 | +# Build an image with the provided Dockerfile by compiling Intel® Extension for PyTorch* from source |
| 40 | +DOCKER_BUILDKIT=1 docker build -f examples/gpu/inference/python/llm/Dockerfile --build-arg GID_RENDER=$(getent group render | sed -E 's,^render:[^:]*:([^:]*):.*$,\1,') --build-arg COMPILE=ON -t ipex-llm:2.1.10 . |
| 41 | + |
| 42 | +# Build an image with the provided Dockerfile by installing from Intel® Extension for PyTorch* prebuilt wheel files |
| 43 | +DOCKER_BUILDKIT=1 docker build -f examples/cpu/inference/python/llm/Dockerfile --build-arg GID_RENDER=$(getent group render | sed -E 's,^render:[^:]*:([^:]*):.*$,\1,') -t ipex-llm:2.1.10 . |
| 44 | + |
| 45 | +# Run the container with command below |
| 46 | +docker run --rm -it --privileged --device=/dev/dri --ipc=host ipex-llm:2.1.10 bash |
| 47 | + |
| 48 | +# When the command prompt shows inside the docker container, enter llm examples directory |
| 49 | +cd llm |
| 50 | +``` |
| 51 | + |
| 52 | +2.b. Alternatively, you can take advantage of a provided environment configuration script to setup an environment without using a docker container. |
| 53 | + |
| 54 | +```bash |
| 55 | +# GCC 12.3 is required. Installation can be taken care of by the environment configuration script. |
| 56 | +# Create a conda environment |
| 57 | +conda create -n llm python=3.9 -y |
| 58 | +conda activate llm |
| 59 | + |
| 60 | +# Setup the environment with the provided script |
| 61 | +cd examples/gpu/inference/python/llm |
| 62 | +# If you want to install Intel® Extension for PyTorch\* from prebuilt wheel files, use the command below: |
| 63 | +bash ./tools/env_setup.sh 7 |
| 64 | +# If you want to install Intel® Extension for PyTorch\* from source, use the commands below: |
| 65 | +bash ./tools/env_setup.sh 3 <DPCPP_ROOT> <ONEMKL_ROOT> <AOT> |
| 66 | +export LD_PRELOAD=$(bash ../../../../../tools/get_libstdcpp_lib.sh) |
| 67 | +``` |
| 68 | + |
| 69 | +\* `DPCPP_ROOT` is the placeholder for path where DPC++ compile was installed to. By default, it is `/opt/intel/oneapi/compiler/latest`.<br /> |
| 70 | +\* `ONEMKL_ROOT` is the placeholder for path where oneMKL was installed to. By default, it is `/opt/intel/oneapi/mkl/latest`.<br /> |
| 71 | +\* `AOT` is a text string to enable `Ahead-Of-Time` compilation for specific GPU models. Check [tutorial](../../../../../docs/tutorials/technical_details/AOT.md) for details.<br /> |
| 72 | + |
| 73 | +3. Once an environment is configured with either method above, set necessary environment variables with an environment variables activation script. |
| 74 | + |
| 75 | +```bash |
| 76 | +# If you use docker images built from the provided Dockerfile, you do NOT need to run the following 2 commands. |
| 77 | +source <DPCPP_ROOT>/env/vars.sh |
| 78 | +source <ONEMKL_ROOT>/env/vars.sh |
| 79 | + |
| 80 | +# Activate environment variables |
| 81 | +source ./tools/env_activate.sh |
| 82 | +``` |
| 83 | + |
| 84 | + |
| 85 | +# Run Models Generations |
| 86 | + |
| 87 | +| Benchmark mode | FP16 | Weight only quantization INT4 | |
| 88 | +|---|:---:|:---:| |
| 89 | +|Single instance | ✅ | ✅ | |
| 90 | +| Distributed (autotp) | ✅ | ❎ | |
| 91 | + |
| 92 | +## Example usages of one-click Python script |
| 93 | +You can run LLM with a one-click bash script "run_benchmark.sh" for all inference cases. |
| 94 | +``` |
| 95 | +bash run_benchmark.sh |
| 96 | +``` |
| 97 | + |
| 98 | +### Single Instance Performance |
| 99 | + |
| 100 | +```bash |
| 101 | +# fp16 benchmark |
| 102 | +python -u run_generation.py --benchmark -m ${model} --sub-model-name ${sub_model_name} --num-beams ${beam} --num-iter ${iter} --batch-size ${bs} --input-tokens ${input} --max-new-tokens ${out} --device xpu --ipex --dtype float16 --token-latency |
| 103 | +``` |
| 104 | + |
| 105 | +Notes: |
| 106 | + |
| 107 | +(1) By default, generations are based on bs = 1, input token size = 1024, output toke size = 128, iteration num = 10 and "beam search", and beam size = 4. For beam size = 1 and other settings, please export env settings, such as: "beam=1", "input=32", "output=32", "iter=5". |
| 108 | + |
| 109 | +### Distributed Performance with DeepSpeed |
| 110 | + |
| 111 | +You can run LLM with a one-click bash script "run_benchmark_ds.sh" for all distributed inference cases. |
| 112 | +``` |
| 113 | +bash run_benchmark_ds.sh |
| 114 | +``` |
| 115 | + |
| 116 | +```bash |
| 117 | +# distributed env setting |
| 118 | +source ${ONECCL_DIR}/build/_install/env/setvars.sh |
| 119 | +# fp16 benchmark |
| 120 | +mpirun -np 2 --prepend-rank python -u run_generation_with_deepspeed.py --benchmark -m ${model} --sub-model-name ${sub_model_name} --num-beams ${beam} --num-iter ${iter} --batch-size ${bs} --input-tokens ${input} --max-new-tokens ${out} --device xpu --ipex --dtype float16 --token-latency |
| 121 | +``` |
| 122 | + |
| 123 | +Notes: |
| 124 | + |
| 125 | +(1) By default, generations are based on bs = 1, input token size = 1024, output toke size = 128, iteration num = 10 and "beam search", and beam size = 4. For beam size = 1 and other settings, please export env settings, such as: "beam=1", "input=32", "output=32", "iter=5". |
| 126 | + |
| 127 | +# Advanced Usage |
| 128 | + |
| 129 | +## Weight only quantization with low precision checkpoint (Experimental) |
| 130 | + |
| 131 | +Using INT4 weights can further improve performance by reducing memory bandwidth. However, direct per-channel quantization of weights to INT4 probably results in poor accuracy. Some algorithms can modify weights through calibration before quantizing weights to minimize accuracy drop. GPTQ is one of such algorithms. You may generate modified weights and quantization info (scales, zero points) for a certain model with a dataset for specified tasks by such algorithms. The results are saved as a `state_dict` in a `.pt` file. We provided a script here to run GPT-J . |
| 132 | + |
| 133 | +### Single Instance GPT-J Weight only quantization Performance |
| 134 | + |
| 135 | +```bash |
| 136 | +# quantization benchmark |
| 137 | +#To run quantization performance, you need to firstly get the quantized weight with the following step (1) and then run the performance benchmark with the following step (2) |
| 138 | +## (1) Get the quantized weight |
| 139 | +download link: https://intel-extension-for-pytorch.s3.amazonaws.com/miscellaneous/llm/xpu/gptj_int4_weight_master.pt |
| 140 | +export weight_path = path-to-your-weight |
| 141 | + |
| 142 | +## (2) Run quantization performance test |
| 143 | +python -u run_generation.py --device xpu --ipex --dtype float16 --input-tokens ${input} --max-new-tokens ${out} --token-latency --benchmark --num-beams ${beam} -m ${model} --sub-model-name ${sub_model_name} --woq --woq_checkpoint_path ${weight_path} |
| 144 | +``` |
| 145 | + |
| 146 | +### Single Instance GPT-J Weight only quantization INT4 Accuracy |
| 147 | + |
| 148 | +```bash |
| 149 | +# we use "lambada_standard" task to check accuracy |
| 150 | +LLM_ACC_TEST=1 python -u run_generation.py -m ${model} --sub-model-name ${sub_model_name} --ipex --dtype float16 --accuracy-only --acc-tasks ${task} --woq --woq_checkpoint_path ${weight_path} |
| 151 | +``` |
| 152 | + |
| 153 | +## Single Instance Accuracy |
| 154 | + |
| 155 | +```bash |
| 156 | +Accuracy test {TASK_NAME}, choice in this [link](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/docs/task_table.md), by default we use "lambada_standard" |
| 157 | + |
| 158 | +# one-click bash script |
| 159 | +bash run_accuracy.sh |
| 160 | + |
| 161 | +# float16 |
| 162 | +LLM_ACC_TEST=1 python -u run_generation.py -m ${model} --sub-model-name ${sub_model_name} --ipex --dtype float16 --accuracy-only --acc-tasks ${task} |
| 163 | +``` |
| 164 | + |
| 165 | +## Distributed Accuracy with DeepSpeed |
| 166 | + |
| 167 | +```bash |
| 168 | +# Run distributed accuracy with 2 ranks of one node for float16 with ipex |
| 169 | +source ${ONECCL_DIR}/build/_install/env/setvars.sh |
| 170 | + |
| 171 | +# one-click bash script |
| 172 | +bash run_accuracy_ds.sh |
| 173 | + |
| 174 | +# float16 |
| 175 | +LLM_ACC_TEST=1 mpirun -np 2 --prepend-rank python -u run_generation_with_deepspeed.py -m ${model} --sub-model-name ${sub_model_name} --ipex --dtype float16 --accuracy-only --acc-tasks ${task} 2>&1 |
| 176 | +``` |
0 commit comments