Skip to content

RPrenger/fastertransformer_backend

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FasterTransformer Backend

The Triton backend for the FasterTransformer. This repository provides a script and recipe to run the highly optimized transformer-based encoder and decoder component, and it is tested and maintained by NVIDIA. In the FasterTransformer v4.0, it supports multi-gpu inference on GPT-3 model. This backend integrates FasterTransformer into Triton to use giant GPT-3 model serving by Triton. In the below example, we will show how to use the FasterTransformer backend in Triton to run inference on a GPT-3 model with 345M parameters trained by Megatron-LM.

Note that this is a research and prototyping tool, not a formal product or maintained framework. User can learn more about Triton backends in the backend repo. Ask questions or report problems on the issues page in this FasterTransformer_backend repo.

Table Of Contents

Setup

  • Prepare Machine

We provide a docker file, which bases on Triton image nvcr.io/nvidia/tritonserver:21.02-py3, to setup the environment.

mkdir workspace && cd workspace 
git clone https://github.com/triton-inference-server/fastertransformer_backend.git
nvidia-docker build --tag ft_backend --file fastertransformer_backend/Dockerfile .
nvidia-docker run --gpus=all -it --rm --volume $HOME:$HOME --volume $PWD:$PWD -w $PWD --name ft-work  ft_backend
cd workspace
export WORKSPACE=$(pwd)
  • Install libraries for Megatron (option)
pip install torch regex fire
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
  • Build FT backend
cd $WORKSPACE
git clone https://github.com/triton-inference-server/server.git
export PATH=/usr/local/mpi/bin:$PATH
source fastertransformer_backend/build.env
mkdir -p fastertransformer_backend/build && cd $WORKSPACE/fastertransformer_backend/build
cmake -DCMAKE_EXPORT_COMPILE_COMMANDS=1 .. && make -j32
  • Prepare model
git clone https://github.com/NVIDIA/Megatron-LM.git
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json -P models
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt -P models
wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_lm_345m/versions/v0.0/zip -O megatron_lm_345m_v0.0.zip
mkdir -p models/megatron-models/345m
unzip megatron_lm_345m_v0.0.zip -d models/megatron-models/345m
python ../sample/pytorch/utils/megatron_ckpt_convert.py -i ./models/megatron-models/345m/release/ -o ./models/megatron-models/c-model/345m/ -t_g 1 -i_g 8
python _deps/repo-ft-src/sample/pytorch/utils/megatron_ckpt_convert.py -i ./models/megatron-models/345m/release/ -o ./models/megatron-models/c-model/345m/ -t_g 1 -i_g 8
cp ./models/megatron-models/c-model/345m/8-gpu $WORKSPACE/fastertransformer_backend/all_models/transformer/1/ -r

Run Serving

  • Run servning directly
cp $WORKSPACE/fastertransformer_backend/build/libtriton_transformer.so $WORKSPACE/fastertransformer_backend/build/lib/libtransformer-shared.so /opt/tritonserver/backends/transformer
cd $WORKSPACE && ln -s server/qa/common .
# Recommend to modify the SERVER_TIMEOUT of common/util.sh to longer time
cd $WORKSPACE/fastertransformer_backend/build/
bash $WORKSPACE/fastertransformer_backend/tools/run_server.sh
bash $WORKSPACE/fastertransformer_backend/tools/run_client.sh
python _deps/repo-ft-src/sample/pytorch/utils/convert_gpt_token.py --out_file=triton_out # Used for checking result
  • Modify the model configuration

The model configuration for Triton server is put in all_models/transformer/config.pbtxt. User can modify the following hyper-parameters:

  • candidate_num: k value of top k
  • probability_threshold: p value of top p
  • tensor_para_size: size of tensor parallelism
  • layer_para_size: size of layer parallelism
  • layer_para_batch_size: Useless in Triton backend becuase this backend only supports single node, and user are recommended to use tensor parallel in single node
  • max_seq_len: max supported sequence length
  • is_half: Using half or not
  • head_num: head number of attention
  • size_per_head: size per head of attention
  • vocab_size: size of vocabulary
  • decoder_layers: number of transformer layers
  • batch_size: max supported batch size
  • is_fuse_QKV: fusing QKV in one matrix multiplication or not. It also depends on the weights of QKV.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C++ 53.1%
  • CMake 21.3%
  • Python 10.3%
  • Shell 8.6%
  • Dockerfile 6.7%