This project mainly supports LLaMa Inference and Microservice deployment based on TencentPretrain.
- Int8 Inference Support int8 inference with the bitsandbytes library, and adds batch inference compared to the LM inference script in tencentpretrain.
- Optimized Inference Added cache for key and value in Multi-head Attention, requiring only the newly generated token to be input for each inference.
- LLM Multi-Gpu Inference Support tensor parallel multi-gpu inference.
- Microservices Support simple flask microservices and gradio-base online demo.
- LoRA model Inference To be continued.
tips: need cuda.
- Python >= 3.7
- torch >= 1.9
- bitsandbytes
- argparse
- --load_model_path (Required) pretrained model, default by fp16.
- --test_path (Required) input prompts,one prompt each line.
- --prediction_path (Required) save path for result.
- --config_path (Required) file of model hyper-parameters, can be stored in config file.
- --spm_model_path (Required) the path of model tokenizer.
- --batch_size (Optional) default by 1. suggestion: consistent with the input.
- --seq_length (Optional) default by 128. total length of generated content, equal to the length of input and generated sentence.
- --world_size (Optional),default by 1. the number of gpus for tensor parallel inference.
- --use_int8 (Optional) default by False. whether use int8 to inference.
- --top_k (Optional) default by 40.
- --top_p (Optional) default by 0.95.
- --temperature (Optional) default by 0.8.
- --repetition_penalty_range (Optional) default by 1024.
- --repetition_penalty_slope (Optional) default by 0.
- --repetition_penalty (Optional) default by 1.15.
fp16 inference:
python llama_infer.py --test_path ./prompts.txt --prediction_path ./result.txt \
--load_model_path xxx.bin \
--config_path ./config/llama_7b_config.json \
--spm_model_path ./tokenizer.model
int8 inference:
python llama_infer.py --test_path ./prompts.txt --prediction_path ./result.txt \
--load_model_path xxx.bin --use_int8 \
--config_path ./config/llama_7b_config.json \
--spm_model_path ./tokenizer.model
optional parameter: keep_length_ratio. it represents keep the ratio of context. enter 'clear' will restart a round of new chat and 'exit' will exit the chat.
python llama_dialogue.py --load_model_path xxxx.bin \
--config_path config.json \
--spm_model_path tokenizer.model \
--world_size 2
need to install gradio
pip install gradio
python llama_gradio.py --load_model_path xxxx.bin \
--config_path config.json \
--spm_model_path tokenizer.model
website open: http://127.0.0.1:7860/
need to install flask
pip install flask
python llama_server.py --load_model_path xxxx.bin \
--config_path config.json \
--spm_model_path tokenizer.model
curl command:
curl -H 'Content-Type: application/json' http://127.0.0.1:8888/chat -d '{"question": "xxx"}'
need to install tensor_parallel world_size = the number of gpu(gpu id start from 0.)
pip install tensor_parallel
python llama_infer.py --test_path ./prompts.txt --prediction_path ./result.txt \
--load_model_path xxxx.bin \
--config_path config.json \
--spm_model_path tokenizer.model \
--world_size 2