-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Hello, we attempt to utilize SpecInfer to accelerate model inference. However, we encounter several performance issues. Specifically, as the batch size increases from 1 to 16, the system throughput gradually improves. But when the batch size reaches 32, there is a significant decline in throughput, which is confusing. Our execution configurations is as follows:
Environment Setup
We use the provided docker image(ghcr.io/flexflow/flexflow-cuda-11.8:latest) and build from source following the docs(https://flexflow.readthedocs.io/en/latest/).
We test two supported models: Llama2-70B and OPT-13B on this dataset(https://huggingface.co/datasets/gbharti/finance-alpaca).
Test Script
We run the model inference following the Quickstart guidance in the repo.
import flexflow.serve as ff
import argparse
import json
import os
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--num_gpus', type=int)
parser.add_argument('--memory_per_gpu', type=int)
parser.add_argument('--zero_copy_memory_per_node', type=int)
parser.add_argument('--tensor_parallelism_degree', type=int)
parser.add_argument('--pipeline_parallelism_degree', type=int)
parser.add_argument('--llm', type=str)
parser.add_argument('--ssm', type=str)
parser.add_argument('--prompts_file', type=str)
parser.add_argument('--max_requests_per_batch', type=int)
parser.add_argument('--max_seq_length', type=int)
parser.add_argument('--max_tokens_per_batch', type=int)
args = parser.parse_args()
ff.init(num_gpus=args.num_gpus,
memory_per_gpu=args.memory_per_gpu,
zero_copy_memory_per_node=args.zero_copy_memory_per_node,
tensor_parallelism_degree=args.tensor_parallelism_degree,
pipeline_parallelism_degree=args.pipeline_parallelism_degree
)
# Specify the LLM
llm = ff.LLM(args.llm)
# Specify a list of SSMs (just one in this case)
ssms=[]
if args.ssm != '':
ssm_names = args.ssm.split(',')
for ssm_name in ssm_names:
ssm = ff.SSM(ssm_name)
ssms.append(ssm)
# Create the sampling configs
generation_config = ff.GenerationConfig(
do_sample=False, temperature=0, topp=1, topk=1
)
# Compile the SSMs for inference and load the weights into memory
for ssm in ssms:
ssm.compile(generation_config,
max_requests_per_batch=args.max_requests_per_batch,
max_seq_length=args.max_seq_length,
max_tokens_per_batch=args.max_tokens_per_batch)
# Compile the LLM for inference and load the weights into memory
llm.compile(generation_config,
ssms=ssms,
max_requests_per_batch=args.max_requests_per_batch,
max_seq_length=args.max_seq_length,
max_tokens_per_batch=args.max_tokens_per_batch
)
# load prompts
with open(args.prompts_file, 'r') as f:
prompts = json.load(f)
llm.start_server()
result = llm.generate(prompts=prompts)Test Results
We run the evaluation on 4 NVIDIA 80-GB A100 GPUs connected over NVLink, and record the throughput when batch size increases from 1 to 32. The results are as follows:
| throughput(tokens/s) | Llama2-70B | OPT-13B |
|---|---|---|
| BS=1 | 28.709671931 | 97.12122162 |
| BS=2 | 52.22124339 | 189.1327599 |
| BS=4 | 106.9214668 | 362.0640686 |
| BS=8 | 182.9473744 | 680.4388029 |
| BS=16 | 322.7966769 | 1188.828348 |
| BS=32 | 298.8251763 | 437.7545888 |
Any help to solve this issue is appreciated!