llama_inference/README_en.md at main · ProjectD-AI/llama_inference

LLaMa Inference For TencentPretrain

This project mainly supports LLaMa Inference and Microservice deployment based on TencentPretrain.

Feature

Int8 Inference Support int8 inference with the bitsandbytes library, and adds batch inference compared to the LM inference script in tencentpretrain.
Optimized Inference Added cache for key and value in Multi-head Attention, requiring only the newly generated token to be input for each inference.
LLM Multi-Gpu Inference Support tensor parallel multi-gpu inference.
Microservices Support simple flask microservices and gradio-base online demo.
LoRA model Inference To be continued.

tips: need cuda.

Requirements

Python >= 3.7
torch >= 1.9
bitsandbytes
argparse

Input Parameters

--load_model_path (Required) pretrained model, default by fp16.
--test_path (Required) input prompts，one prompt each line.
--prediction_path (Required) save path for result.
--config_path (Required) file of model hyper-parameters, can be stored in config file.
--spm_model_path (Required) the path of model tokenizer.
--batch_size (Optional) default by 1. suggestion: consistent with the input.
--seq_length (Optional) default by 128. total length of generated content, equal to the length of input and generated sentence.
--world_size （Optional），default by 1. the number of gpus for tensor parallel inference.
--use_int8 (Optional) default by False. whether use int8 to inference.
--top_k (Optional) default by 40.
--top_p (Optional) default by 0.95.
--temperature (Optional) default by 0.8.
--repetition_penalty_range (Optional) default by 1024.
--repetition_penalty_slope (Optional) default by 0.
--repetition_penalty (Optional) default by 1.15.

Quick Start

FP16/Int8 Inference

fp16 inference：

python llama_infer.py --test_path ./prompts.txt --prediction_path ./result.txt  \
                      --load_model_path xxx.bin \
                      --config_path ./config/llama_7b_config.json \
                      --spm_model_path ./tokenizer.model

int8 inference:

python llama_infer.py --test_path ./prompts.txt --prediction_path ./result.txt  \
                      --load_model_path xxx.bin --use_int8 \
                      --config_path ./config/llama_7b_config.json \
                      --spm_model_path ./tokenizer.model

Multi-round chat

optional parameter: keep_length_ratio. it represents keep the ratio of context. enter 'clear' will restart a round of new chat and 'exit' will exit the chat.

python llama_dialogue.py --load_model_path xxxx.bin \
                         --config_path config.json \
                         --spm_model_path tokenizer.model \
                         --world_size 2

gradio server

need to install gradio

pip install gradio
python llama_gradio.py --load_model_path xxxx.bin \
                       --config_path config.json \
                       --spm_model_path tokenizer.model

website open: http://127.0.0.1:7860/

Microservices deployment

need to install flask

pip install flask 
python llama_server.py --load_model_path xxxx.bin \
                       --config_path config.json \
                       --spm_model_path tokenizer.model

curl command:

curl -H 'Content-Type: application/json' http://127.0.0.1:8888/chat -d '{"question": "xxx"}'

Multi-GPU Inference

need to install tensor_parallel world_size = the number of gpu（gpu id start from 0.）

pip install tensor_parallel
python llama_infer.py --test_path ./prompts.txt --prediction_path ./result.txt \
                      --load_model_path xxxx.bin \
                      --config_path config.json \
                      --spm_model_path tokenizer.model \
                      --world_size 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLaMa Inference For TencentPretrain

Feature

Requirements

Input Parameters

Quick Start

FP16/Int8 Inference

Multi-round chat

gradio server

Microservices deployment

Multi-GPU Inference

FilesExpand file tree

README_en.md

Latest commit

History

README_en.md

File metadata and controls

LLaMa Inference For TencentPretrain

Feature

Requirements

Input Parameters

Quick Start

FP16/Int8 Inference

Multi-round chat

gradio server

Microservices deployment

Multi-GPU Inference