Skip to content

Commit cd597c8

Browse files
authored
Followup PR for adding generation-server (#339)
* fix grpc * use functools.partial * fix ds-inference server * add support for int8 * update README * fix bugs Co-authored-by: Mayank Mishra <[email protected]>
1 parent 479aac3 commit cd597c8

File tree

12 files changed

+127
-128
lines changed

12 files changed

+127
-128
lines changed

scripts/bloom-inference-server/README.md

Lines changed: 14 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,7 @@ We support HuggingFace accelerate and DeepSpeed Inference for generation.
44
Install required packages:
55

66
```shell
7-
pip install fastapi uvicorn accelerate huggingface_hub>=0.9.0
8-
```
9-
To install [DeepSpeed](https://github.com/microsoft/DeepSpeed):
10-
```shell
11-
git clone https://github.com/microsoft/DeepSpeed
12-
cd DeepSpeed
13-
CFLAGS="-I$CONDA_PREFIX/include/" LDFLAGS="-L$CONDA_PREFIX/lib/" TORCH_CUDA_ARCH_LIST="7.0" DS_BUILD_CPU_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 pip install -e . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check
7+
pip install fastapi uvicorn accelerate huggingface_hub>=0.9.0 deepspeed>=0.7.3
148
```
159
To install [DeepSpeed-MII](https://github.com/microsoft/DeepSpeed-MII):
1610
```shell
@@ -19,14 +13,9 @@ cd DeepSpeed-MII
1913
pip install .
2014
```
2115

22-
All the provided scripts are tested on 8 A100 80GB GPUs for BLOOM 176B. These scripts might not work for other models or a different number of GPUs.
23-
DS inference only supports fp16 for cli and server application. However, for benchmarking, it supports both fp16 and bf16. bf16 support will be added once DeepSpeed adds suitable CUDA kernels for these.
16+
All the provided scripts are tested on 8 A100 80GB GPUs for BLOOM 176B (fp16/bf16) and 4 A100 80GB GPUs for BLOOM 176B (int8). These scripts might not work for other models or a different number of GPUs.
2417

25-
DS inference is deployed using the DeepSpeed MII library which requires the resharded checkpoints for 8 x Tensor Parallel. The HuggingFace checkpoints can be resharded and cached using the following command:
26-
```shell
27-
deepspeed --num_gpus 8 scripts/bloom-inference-server/cache_ds_checkpoints.py --model_name bigscience/bloom --dtype fp16 --save_mp_checkpoint_path <PATH TO DS CACHED MODEL>
28-
```
29-
Note: Running the above script will consume ~350 GB of disk space and will take some time (~30 minutes), depending on both the speed of your GPUs and storage.
18+
DS inference is deployed using the DeepSpeed MII library which requires the resharded checkpoints for 8 x Tensor Parallel.
3019

3120
Note: sometimes GPU memory is not freed when DS inference deployment is shutdown. You can free this memory by running:
3221
```python
@@ -35,6 +24,10 @@ mii.terminate("ds_inference_grpc_server")
3524
```
3625
or alternatively, just doing a `killall python` in terminal.
3726

27+
For using BLOOM quantized, use dtype = int8. Also, change the model_name to microsoft/bloom-deepspeed-inference-int8 for DeepSpeed-Inference. For HF accelerate, no change is needed for model_name.
28+
29+
HF accelerate uses [LLM.int8()](https://arxiv.org/abs/2208.07339) and DS-inference uses [ZeroQuant](https://arxiv.org/abs/2206.01861) for post-training quantization.
30+
3831
#### BLOOM inference via command-line
3932
This asks for generate_kwargs everytime.
4033
Example: generate_kwargs =
@@ -49,7 +42,7 @@ python scripts/bloom-inference-server/cli.py --model_name bigscience/bloom --dty
4942

5043
2. using DS inference
5144
```shell
52-
python scripts/bloom-inference-server/cli.py --model_name bigscience/bloom --dtype fp16 --deployment_framework ds_inference --save_mp_checkpoint_path <PATH TO DS CACHED MODEL> --generate_kwargs '{"min_length": 100, "max_new_tokens": 100, "do_sample": false}'
45+
python scripts/bloom-inference-server/cli.py --model_name microsoft/bloom-deepspeed-inference-fp16 --dtype fp16 --deployment_framework ds_inference --generate_kwargs '{"min_length": 100, "max_new_tokens": 100, "do_sample": false}'
5346
```
5447

5548
#### BLOOM server deployment
@@ -60,7 +53,7 @@ python scripts/bloom-inference-server/server.py --model_name bigscience/bloom --
6053

6154
2. using DS inference
6255
```shell
63-
python scripts/bloom-inference-server/server.py --model_name bigscience/bloom --dtype fp16 --deployment_framework ds_inference --save_mp_checkpoint_path <PATH TO DS CACHED MODEL> --host <HOST ADDRESS> --port <PORT> --allowed_max_new_tokens 100
56+
python scripts/bloom-inference-server/server.py --model_name microsoft/bloom-deepspeed-inference-fp16 --dtype fp16 --deployment_framework ds_inference --host <HOST ADDRESS> --port <PORT> --allowed_max_new_tokens 100
6457
```
6558

6659
We provide an example [script](examples/server_request.py) to query the BLOOM server is provided. To run this script:
@@ -76,32 +69,14 @@ python scripts/bloom-inference-server/benchmark.py --model_name bigscience/bloom
7669

7770
2. using DS inference
7871
```shell
79-
deepspeed --num_gpus 8 scripts/bloom-inference-server/benchmark.py --model_name bigscience/bloom --dtype fp16 --deployment_framework ds_inference --save_mp_checkpoint_path <PATH TO DS CACHED MODEL> --benchmark_cycles 5
72+
deepspeed --num_gpus 8 scripts/bloom-inference-server/benchmark.py --model_name bigscience/bloom --dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5
8073
```
81-
82-
3. using DS ZeRO
74+
alternatively, to load model faster:
8375
```shell
84-
deepspeed --num_gpus 8 scripts/bloom-inference-server/benchmark.py --model_name bigscience/bloom --dtype bf16 --deployment_framework ds_zero --benchmark_cycles 5
76+
deepspeed --num_gpus 8 scripts/bloom-inference-server/benchmark.py --model_name microsoft/bloom-deepspeed-inference-fp16 --dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5
8577
```
8678

87-
Alternatively, the following shell script will benchmark different batch sizes for the model.
88-
```shell
89-
mkdir -p logs
90-
91-
for bs in {1,2,4,8,16,32,64,128}
92-
do
93-
python scripts/bloom-inference-server/benchmark.py --model_name bigscience/bloom --dtype bf16 --deployment_framework hf_accelerate --benchmark_cycles 5 --batch_size $bs 2>&1 | tee logs/hf-$bs.log
94-
95-
deepspeed --num_gpus 8 scripts/bloom-inference-server/benchmark.py --model_name bigscience/bloom --dtype fp16 --deployment_framework ds_inference --save_mp_checkpoint_path <PATH TO DS CACHED MODEL> --benchmark_cycles 5 --batch_size $bs 2>&1 | tee logs/ds-$bs.log
96-
97-
deepspeed --num_gpus 8 scripts/bloom-inference-server/benchmark.py --model_name bigscience/bloom --dtype bf16 --deployment_framework ds_zero --benchmark_cycles 5 --batch_size $bs 2>&1 | tee logs/ds-zero-$bs.log
98-
done
99-
```
100-
101-
The following will benchmark sequence length for batch size = 1 on DS inference.
79+
3. using DS ZeRO
10280
```shell
103-
for sq in {1,10,50,100,200,300,400,500,600,700,800,900,1000,1500,2000,2500,3000,3500,4000,4500,5000}
104-
do
105-
deepspeed --num_gpus 8 scripts/bloom-inference-server/benchmark.py --model_name bigscience/bloom --dtype fp16 --batch_size 1 --benchmark_cycles 5 --deployment_framework ds_inference --generate_kwargs '{"do_sample": false, "min_length": '$sq', "max_new_tokens": '$sq'}' 2>&1 | tee logs/ds_$sq.log
106-
done
81+
deepspeed --num_gpus 8 scripts/bloom-inference-server/benchmark.py --model_name bigscience/bloom --dtype bf16 --deployment_framework ds_zero --benchmark_cycles 5
10782
```

scripts/bloom-inference-server/benchmark.py

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
import argparse
22
import gc
33
import os
4+
from functools import partial
45

56
import deepspeed
67
import torch
@@ -57,14 +58,16 @@ def benchmark_end_to_end(args: argparse.Namespace,
5758
model_class: Model,
5859
zero_activated: bool = False) -> None:
5960
model, initialization_time = run_and_log_time(
60-
(model_class, {"args": args})
61+
partial(model_class, args=args)
6162
)
6263

6364
request = parse_generate_kwargs(
6465
get_dummy_batch(args.batch_size),
6566
args.generate_kwargs
6667
)
6768

69+
request.preprocess()
70+
6871
print_rank_n(f"generate_kwargs = {args.generate_kwargs}")
6972
print_rank_n(f"batch_size = {args.batch_size}")
7073

@@ -87,13 +90,11 @@ def benchmark_end_to_end(args: argparse.Namespace,
8790

8891
# benchmark
8992
total_new_tokens_generated, benchmark_time = run_and_log_time(
90-
(
93+
partial(
9194
benchmark_generation,
92-
{
93-
"model": model,
94-
"request": request,
95-
"cycles": args.benchmark_cycles
96-
}
95+
model=model,
96+
request=request,
97+
cycles=args.benchmark_cycles
9798
)
9899
)
99100

scripts/bloom-inference-server/cli.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,9 @@ def main() -> None:
5555
continue
5656

5757
request = parse_generate_kwargs([input_text], generate_kwargs)
58+
59+
request.preprocess()
60+
5861
response = model.generate(request)
5962

6063
print_rank_n("Output text:", response.text[0])

scripts/bloom-inference-server/ds_inference/grpc_server.py

Lines changed: 28 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -6,59 +6,59 @@
66
from transformers import AutoTokenizer
77

88
import mii
9-
from utils import GenerateRequest, GenerateResponse, Model, get_filter_dict, get_str_dtype, print_rank_n
9+
from utils import (
10+
GenerateRequest,
11+
GenerateResponse,
12+
Model,
13+
get_downloaded_model_path,
14+
get_filter_dict,
15+
get_str_dtype,
16+
print_rank_n
17+
)
1018

1119

1220
class DSInferenceGRPCServer(Model):
1321
def __init__(self, args: argparse.Namespace) -> None:
1422
self.deployment_name = "ds_inference_grpc_server"
1523

16-
files = os.listdir(args.save_mp_checkpoint_path)
17-
for file in files:
18-
if (file.endswith(".json")):
19-
checkpoints_json = json.load(
20-
open(os.path.join(args.save_mp_checkpoint_path, file), "r"))
21-
break
24+
downloaded_model_path = get_downloaded_model_path(args.model_name)
2225

23-
if ("base_dir" in checkpoints_json):
24-
del checkpoints_json["base_dir"]
26+
self.tokenizer = AutoTokenizer.from_pretrained(downloaded_model_path)
27+
self.pad = self.tokenizer.pad_token_id
28+
29+
if (args.dtype in [torch.float16, torch.int8]):
30+
checkpoints_json = os.path.join(
31+
downloaded_model_path, "ds_inference_config.json")
2532

26-
if (args.dtype == torch.float16):
2733
mii.deploy(
2834
task="text-generation",
29-
model=args.model_name,
35+
# should pass args.model_name but can't since the new
36+
# weights are not supported yet. So, this is a hack
37+
model="bigscience/bloom",
3038
deployment_name=self.deployment_name,
39+
model_path=downloaded_model_path,
3140
mii_config={
3241
"dtype": get_str_dtype(args.dtype),
3342
"tensor_parallel": 8,
3443
"port_number": 50950,
35-
"checkpoint_dict": checkpoints_json
36-
},
37-
model_path=args.save_mp_checkpoint_path
44+
"checkpoint_dict": json.load(open(checkpoints_json, "r"))
45+
}
3846
)
39-
else:
40-
raise NotImplementedError("This is not yet supported")
47+
elif (args.dtype == torch.bfloat16):
48+
raise NotImplementedError("bfloat16 is not yet supported")
4149

42-
self.tokenizer = AutoTokenizer.from_pretrained(args.model_name)
43-
self.pad = self.tokenizer.pad_token_id
4450
self.model = mii.mii_query_handle(self.deployment_name)
4551

4652
def generate(self, request: GenerateRequest) -> GenerateResponse:
47-
text = request.text
48-
49-
return_type = type(text)
50-
if (return_type == str):
51-
text = [text]
52-
5353
output_text = self.model.query(
54-
{"query": text},
54+
{"query": request.text},
5555
**get_filter_dict(request)
5656
).response
5757

5858
output_text = [_ for _ in output_text]
5959

6060
# Remove input from output
61-
input_tokens = self.tokenizer(text).input_ids
61+
input_tokens = self.tokenizer(request.text).input_ids
6262
output_tokens = self.tokenizer(output_text).input_ids
6363

6464
input_token_lengths = [len(x) for x in input_tokens]
@@ -72,10 +72,6 @@ def generate(self, request: GenerateRequest) -> GenerateResponse:
7272
output_text = self.tokenizer.batch_decode(
7373
output_tokens, skip_special_tokens=True)
7474

75-
if (return_type == str):
76-
output_text = output_text[0]
77-
num_generated_tokens = num_generated_tokens[0]
78-
7975
return GenerateResponse(
8076
text=output_text,
8177
num_generated_tokens=num_generated_tokens
@@ -87,4 +83,5 @@ def shutdown(self) -> None:
8783
try:
8884
mii.terminate(self.deployment_name)
8985
except Exception:
90-
exit()
86+
pass
87+
exit()

scripts/bloom-inference-server/ds_inference/model.py

Lines changed: 5 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,11 @@
33
import json
44
import os
55
from argparse import Namespace
6+
from functools import partial
67

78
import deepspeed
89
import torch
10+
import torch.distributed as dist
911
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
1012

1113
from utils import Model, get_downloaded_model_path, print_rank_n, run_rank_n
@@ -58,6 +60,7 @@ def __init__(self, args: Namespace) -> None:
5860
self.input_device = torch.cuda.current_device()
5961

6062
print_rank_n("Model loaded")
63+
dist.barrier()
6164

6265

6366
class TemporaryCheckpointsJSON:
@@ -77,17 +80,10 @@ def write_checkpoints_json(self, model_path: str) -> None:
7780

7881
def __enter__(self):
7982
run_rank_n(
80-
os.makedirs,
81-
{
82-
"name": self.tmp_directory,
83-
"exist_ok": True
84-
}
83+
partial(os.makedirs, name=self.tmp_directory, exist_ok=True)
8584
)
8685
run_rank_n(
87-
self.write_checkpoints_json,
88-
{
89-
"model_path": self.model_path
90-
},
86+
partial(self.write_checkpoints_json, model_path=self.model_path),
9187
barrier=True
9288
)
9389
return self.tmp_file

scripts/bloom-inference-server/ds_zero/model.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33

44
import deepspeed
55
import torch
6+
import torch.distributed as dist
67
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
78
from transformers.deepspeed import HfDeepSpeedConfig
89

@@ -64,3 +65,4 @@ def __init__(self, args: Namespace) -> None:
6465
self.input_device = torch.cuda.current_device()
6566

6667
print_rank_n("Model loaded")
68+
dist.barrier()

scripts/bloom-inference-server/hf_accelerate/model.py

Lines changed: 32 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -15,13 +15,20 @@ def __init__(self, args: Namespace) -> None:
1515
self.tokenizer = AutoTokenizer.from_pretrained(downloaded_model_path)
1616
self.pad = self.tokenizer.pad_token_id
1717

18-
self.model = AutoModelForCausalLM.from_pretrained(
19-
downloaded_model_path,
20-
device_map="auto",
21-
max_memory=get_max_memory_per_gpu_dict(
22-
args.dtype, args.model_name),
23-
torch_dtype=args.dtype
24-
)
18+
kwargs = {
19+
"pretrained_model_name_or_path": downloaded_model_path,
20+
"device_map": "auto",
21+
"max_memory": get_max_memory_per_gpu_dict(
22+
args.dtype,
23+
args.model_name
24+
)
25+
}
26+
if (args.dtype == torch.int8):
27+
kwargs["load_in_8bit"] = True
28+
else:
29+
kwargs["torch_dtype"] = args.dtype
30+
31+
self.model = AutoModelForCausalLM.from_pretrained(**kwargs)
2532

2633
self.model.requires_grad_(False)
2734
self.model.eval()
@@ -39,14 +46,20 @@ def get_max_memory_per_gpu_dict(dtype, model_name):
3946
if model_name == "bigscience/bloom" and n_gpus == 8 and torch.cuda.get_device_properties(0).total_memory > 79*2**30:
4047
# hand crafted optimized memory map for 8x80 setup over BLOOM
4148
# this works with bs=40
42-
return {0: '0GIB', 1: '51GIB', 2: '51GIB', 3: '51GIB', 4: '51GIB', 5: '51GIB', 6: '51GIB', 7: '51GIB'}
43-
49+
if (dtype in [torch.bfloat16, torch.float16]):
50+
max_memory_per_gpu = {0: '0GIB', 1: '51GIB', 2: '51GIB', 3: '51GIB',
51+
4: '51GIB', 5: '51GIB', 6: '51GIB', 7: '51GIB'}
52+
elif (dtype == torch.int8):
53+
max_memory_per_gpu = {0: '0GIB', 1: '26GIB', 2: '26GIB', 3: '26GIB',
54+
4: '26GIB', 5: '26GIB', 6: '26GIB', 7: '26GIB'}
55+
print_rank_n("Max memory per gpu:", max_memory_per_gpu)
56+
return max_memory_per_gpu
4457
try:
4558
# model_params calculation, as we don't have a model yet to do:
4659
#model_params = sum(dict((p.data_ptr(), p.numel()) for p in model.parameters()).values())
4760

4861
config = AutoConfig.from_pretrained(model_name)
49-
h = config.n_embed
62+
h = config.hidden_size
5063
l = config.n_layer
5164
v = config.vocab_size
5265
# from https://github.com/bigscience-workshop/bigscience/tree/6917a3b5fefcf439d3485ca184b4d9f6ab605150/math#model-sizing
@@ -56,11 +69,14 @@ def get_max_memory_per_gpu_dict(dtype, model_name):
5669
f"The model {model_name} has a broken config file. Please notify the owner")
5770
raise
5871

59-
bytes = torch.finfo(dtype).bits / 8
72+
if (dtype == torch.int8):
73+
bytes = 1
74+
else:
75+
bytes = torch.finfo(dtype).bits / 8
6076
param_memory_total_in_bytes = model_params * bytes
6177
# add 5% since weight sizes aren't the same and some GPU may need more memory
6278
param_memory_per_gpu_in_bytes = int(
63-
param_memory_total_in_bytes / n_gpus * 1.05)
79+
param_memory_total_in_bytes / n_gpus * 1.10)
6480
print_rank_n(
6581
f"Estimating {param_memory_per_gpu_in_bytes/2**30:0.2f}GB per gpu for weights")
6682

@@ -72,4 +88,7 @@ def get_max_memory_per_gpu_dict(dtype, model_name):
7288
raise ValueError(
7389
f"Unable to generate the memory map automatically as the needed estimated memory per gpu ({param_memory_per_gpu_in_bytes/2**30:0.2f}GB) is bigger than the available per gpu memory ({max_memory_per_gpu_in_bytes/2**30:0.2f}GB)")
7490

75-
return {i: param_memory_per_gpu_in_bytes for i in range(torch.cuda.device_count())}
91+
max_memory_per_gpu = {
92+
i: param_memory_per_gpu_in_bytes for i in range(torch.cuda.device_count())}
93+
print("Max memory per gpu:", max_memory_per_gpu)
94+
return max_memory_per_gpu

0 commit comments

Comments
 (0)