Skip to content

Latest commit

 

History

History
116 lines (96 loc) · 4.01 KB

File metadata and controls

116 lines (96 loc) · 4.01 KB

LLaMa Inference For TencentPretrain

This project mainly supports LLaMa Inference and Microservice deployment based on TencentPretrain.


Feature

  • Int8 Inference Support int8 inference with the bitsandbytes library, and adds batch inference compared to the LM inference script in tencentpretrain.
  • Optimized Inference Added cache for key and value in Multi-head Attention, requiring only the newly generated token to be input for each inference.
  • LLM Multi-Gpu Inference Support tensor parallel multi-gpu inference.
  • Microservices Support simple flask microservices and gradio-base online demo.
  • LoRA model Inference To be continued.

tips: need cuda.


Requirements

  • Python >= 3.7
  • torch >= 1.9
  • bitsandbytes
  • argparse

Input Parameters

  • --load_model_path (Required) pretrained model, default by fp16.
  • --test_path (Required) input prompts,one prompt each line.
  • --prediction_path (Required) save path for result.
  • --config_path (Required) file of model hyper-parameters, can be stored in config file.
  • --spm_model_path (Required) the path of model tokenizer.
  • --batch_size (Optional) default by 1. suggestion: consistent with the input.
  • --seq_length (Optional) default by 128. total length of generated content, equal to the length of input and generated sentence.
  • --world_size (Optional),default by 1. the number of gpus for tensor parallel inference.
  • --use_int8 (Optional) default by False. whether use int8 to inference.
  • --top_k (Optional) default by 40.
  • --top_p (Optional) default by 0.95.
  • --temperature (Optional) default by 0.8.
  • --repetition_penalty_range (Optional) default by 1024.
  • --repetition_penalty_slope (Optional) default by 0.
  • --repetition_penalty (Optional) default by 1.15.

Quick Start

FP16/Int8 Inference

fp16 inference:

python llama_infer.py --test_path ./prompts.txt --prediction_path ./result.txt  \
                      --load_model_path xxx.bin \
                      --config_path ./config/llama_7b_config.json \
                      --spm_model_path ./tokenizer.model

int8 inference:

python llama_infer.py --test_path ./prompts.txt --prediction_path ./result.txt  \
                      --load_model_path xxx.bin --use_int8 \
                      --config_path ./config/llama_7b_config.json \
                      --spm_model_path ./tokenizer.model

Multi-round chat

optional parameter: keep_length_ratio. it represents keep the ratio of context. enter 'clear' will restart a round of new chat and 'exit' will exit the chat.

python llama_dialogue.py --load_model_path xxxx.bin \
                         --config_path config.json \
                         --spm_model_path tokenizer.model \
                         --world_size 2

gradio server

need to install gradio

pip install gradio
python llama_gradio.py --load_model_path xxxx.bin \
                       --config_path config.json \
                       --spm_model_path tokenizer.model

website open: http://127.0.0.1:7860/


Microservices deployment

need to install flask

pip install flask 
python llama_server.py --load_model_path xxxx.bin \
                       --config_path config.json \
                       --spm_model_path tokenizer.model

curl command:

curl -H 'Content-Type: application/json' http://127.0.0.1:8888/chat -d '{"question": "xxx"}' 

Multi-GPU Inference

need to install tensor_parallel world_size = the number of gpu(gpu id start from 0.)

pip install tensor_parallel
python llama_infer.py --test_path ./prompts.txt --prediction_path ./result.txt \
                      --load_model_path xxxx.bin \
                      --config_path config.json \
                      --spm_model_path tokenizer.model \
                      --world_size 2