FasterTransformer Backend

The Triton backend for the FasterTransformer. This repository provides a script and recipe to run the highly optimized transformer-based encoder and decoder component, and it is tested and maintained by NVIDIA. In the FasterTransformer v4.0, it supports multi-gpu inference on GPT-3 model. This backend integrates FasterTransformer into Triton to use giant GPT-3 model serving by Triton. In the below example, we will show how to use the FasterTransformer backend in Triton to run inference on a GPT-3 model with 345M parameters trained by Megatron-LM.

Note that this is a research and prototyping tool, not a formal product or maintained framework. User can learn more about Triton backends in the backend repo. Ask questions or report problems on the issues page in this FasterTransformer_backend repo.

Setup

Prepare Machine

We provide a docker file, which bases on Triton image nvcr.io/nvidia/tritonserver:21.02-py3, to setup the environment.

mkdir workspace && cd workspace 
git clone https://github.com/triton-inference-server/fastertransformer_backend.git
nvidia-docker build --tag ft_backend --file fastertransformer_backend/Dockerfile .
nvidia-docker run --gpus=all -it --rm --volume $HOME:$HOME --volume $PWD:$PWD -w $PWD --name ft-work  ft_backend
cd workspace
export WORKSPACE=$(pwd)

Install libraries for Megatron (option)

pip install torch regex fire
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Build FT backend

cd $WORKSPACE
git clone https://github.com/triton-inference-server/server.git
export PATH=/usr/local/mpi/bin:$PATH
source fastertransformer_backend/build.env
mkdir -p fastertransformer_backend/build && cd $WORKSPACE/fastertransformer_backend/build
cmake -DCMAKE_EXPORT_COMPILE_COMMANDS=1 .. && make -j32

Prepare model

git clone https://github.com/NVIDIA/Megatron-LM.git
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json -P models
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt -P models
wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_lm_345m/versions/v0.0/zip -O megatron_lm_345m_v0.0.zip
mkdir -p models/megatron-models/345m
unzip megatron_lm_345m_v0.0.zip -d models/megatron-models/345m
python ../sample/pytorch/utils/megatron_ckpt_convert.py -i ./models/megatron-models/345m/release/ -o ./models/megatron-models/c-model/345m/ -t_g 1 -i_g 8
python _deps/repo-ft-src/sample/pytorch/utils/megatron_ckpt_convert.py -i ./models/megatron-models/345m/release/ -o ./models/megatron-models/c-model/345m/ -t_g 1 -i_g 8
cp ./models/megatron-models/c-model/345m/8-gpu $WORKSPACE/fastertransformer_backend/all_models/transformer/1/ -r

Run Serving

Run servning directly

cp $WORKSPACE/fastertransformer_backend/build/libtriton_transformer.so $WORKSPACE/fastertransformer_backend/build/lib/libtransformer-shared.so /opt/tritonserver/backends/transformer
cd $WORKSPACE && ln -s server/qa/common .
# Recommend to modify the SERVER_TIMEOUT of common/util.sh to longer time
cd $WORKSPACE/fastertransformer_backend/build/
bash $WORKSPACE/fastertransformer_backend/tools/run_server.sh
bash $WORKSPACE/fastertransformer_backend/tools/run_client.sh
python _deps/repo-ft-src/sample/pytorch/utils/convert_gpt_token.py --out_file=triton_out # Used for checking result

Modify the model configuration

The model configuration for Triton server is put in all_models/transformer/config.pbtxt. User can modify the following hyper-parameters:

candidate_num: k value of top k
probability_threshold: p value of top p
tensor_para_size: size of tensor parallelism
layer_para_size: size of layer parallelism
layer_para_batch_size: Useless in Triton backend becuase this backend only supports single node, and user are recommended to use tensor parallel in single node
max_seq_len: max supported sequence length
is_half: Using half or not
head_num: head number of attention
size_per_head: size per head of attention
vocab_size: size of vocabulary
decoder_layers: number of transformer layers
batch_size: max supported batch size
is_fuse_QKV: fusing QKV in one matrix multiplication or not. It also depends on the weights of QKV.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
all_models/transformer		all_models/transformer
cmake		cmake
src		src
tools		tools
.clang-format		.clang-format
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
build.env		build.env

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FasterTransformer Backend

Table Of Contents

Setup

Run Serving

About

Uh oh!

Releases

Packages

Languages

License

RPrenger/fastertransformer_backend

Folders and files

Latest commit

History

Repository files navigation

FasterTransformer Backend

Table Of Contents

Setup

Run Serving

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages