Skip to content

Commit 3932c74

Browse files
stas00Reza Yazdanijeffra
authored
BLOOM Inference via DeepSpeed-Inference, Accelerate and DeepSpeed-ZeRO (#308)
* hardcode the dtype depending on the model * change the mp based on the world_size * remove hardcoded world_size * add bigscience/bigscience-small-testing * fixes * add zero-inference script * fixes * fix * working script * renames * fixes * fix for offline use * add benchmark * add benchmark * update * cleanup * update * msecs * cleanup * improve * fix benchmark, add warmup * update * fix; thanks Michael Wyatt * clarify * add bloom batch-inference script * removed the names :-) * fold the bs functionality from the other script * fix * restore do_sample * dump generate args * fix * fix * support any batchsize * div by bs * mul by bs * add cpu_offload; sync scripts * wip * improvements * fixes * fixes * add accelerate script * fix * wip * wip * stats * add OnDevice and remove zero-inference (#316) * wip * rework generate + benchmark * figure out the memory map dynamically * bug fix * fix ds-zero-inference wrt device * bug fix * update * update * fix Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Jeff Rasley <[email protected]>
1 parent 0f23a72 commit 3932c74

File tree

5 files changed

+890
-153
lines changed

5 files changed

+890
-153
lines changed

scripts/inference/README.md

Lines changed: 194 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,195 @@
11
# Inference scripts for BLOOM
2+
3+
## BLOOM Inference solutions
4+
5+
Here are some stats on JeanZay's 8x80GB A100 node w/ 512GB of CPU memory:
6+
7+
All benchmarks are doing greedy generation of 100 token outputs:
8+
```
9+
Generate args {'min_length': 100, 'max_length': 100, 'do_sample': False}
10+
```
11+
The inputs are just a few tokens.
12+
13+
Throughput in msecs:
14+
15+
| project \ bs | 1 | 8 | 16 | 32 | 64 | 128 |
16+
| :----------- | :---- | :---- | :---- | :---- | :---- | :--- |
17+
| accelerate | 230.38 | 31.78 | 17.84 | 10.89 | oom | omm |
18+
| ds-inference | 40.57 | 5.23 | | | 2.77 | 0.66 |
19+
| ds-zero | 283 | 34.88 | oom | oom | oom | oom |
20+
21+
22+
Start to ready to generate in secs:
23+
24+
| project \ bs | 1 | 8 | 16 | 32 | 64 | 128 |
25+
| :----------- | :--- | :--- | :--- | :--- | :--- | :--- |
26+
| accelerate | 121 | 120 | 113 | 118 | | |
27+
| ds-inference | 662 | 673 | | | 685 | 654 |
28+
| ds-zero | 462 | 463 | | | | |
29+
| | | | | | | |
30+
31+
32+
DS-Inference load time (start to ready to generate) will become much faster soon. Once we stop relying on ds-zero to instantiate the model on gpu. The plan is to pre-shard the weights TP-wise for 8x and 16x gpus and load them directly on each gpu. Will probably be under 1min.
33+
34+
35+
## Deepspeed-Inference
36+
37+
Tensor-Parallelism and efficient fused CUDA kernels:
38+
https://www.deepspeed.ai/tutorials/inference-tutorial/
39+
40+
### Setup
41+
42+
```
43+
git clone https://github.com/microsoft/DeepSpeed
44+
cd DeepSpeed
45+
pip install .
46+
```
47+
48+
### Run
49+
50+
```
51+
deepspeed --num_gpus 8 scripts/inference/bloom-ds-inference.py --name bigscience/bloom
52+
```
53+
54+
Performance on a single node of 8x80GB A100 w/ 512GB CPU RAM (JeanZay) - just a batch of 1 (would be more efficient to run a larger batch)
55+
56+
Adding `--benchmark` to activate the benchmarks
57+
58+
59+
BS=1
60+
```
61+
$ deepspeed --num_gpus 8 scripts/inference/bloom-ds-inference.py --name bigscience/bloom --batch_size 1 --benchmark 2>&1 | tee bloom-ds-inference_bs=1.txt
62+
[...]
63+
64+
```
65+
66+
While processing memory per process:
67+
68+
- GPU: ~50GB
69+
- CPU: ~10GB
70+
71+
72+
BS=8
73+
```
74+
$ deepspeed --num_gpus 8 scripts/inference/bloom-ds-inference.py --name bigscience/bloom --batch_size 8 --benchmark 2>&1 | tee bloom-ds-inference_bs=8.txt
75+
[...]
76+
*** Performance stats:
77+
Throughput per token including tokenize: 5.23 msecs
78+
Start to ready to generate: 683.397 secs
79+
Tokenize and generate 800 (bs=8) tokens: 4.241 secs
80+
Start to finish: 687.638 secs
81+
```
82+
83+
BS=64
84+
85+
```
86+
$ deepspeed --num_gpus 8 scripts/inference/bloom-ds-inference.py --name bigscience/bloom --batch_size 64 --benchmark 2>&1 | tee bloom-ds-inference_bs=64.txt
87+
88+
89+
90+
91+
```
92+
93+
BS=128
94+
95+
```
96+
$ deepspeed --num_gpus 8 scripts/inference/bloom-ds-inference.py --name bigscience/bloom --batch_size 128 --benchmark 2>&1 | tee bloom-ds-inference_bs=128.txt
97+
98+
99+
100+
101+
```
102+
103+
## Deepspeed ZeRO-Inference
104+
105+
https://www.deepspeed.ai/tutorials/zero/
106+
107+
### Setup
108+
109+
```
110+
pip install deepspeed
111+
```
112+
113+
114+
### Run
115+
116+
Note that the script currently runs the same inputs on all GPUs, but you can run a different stream on each GPU, and get `n_gpu` times faster throughput. You can't do that with Deepspeed-Inference.
117+
118+
119+
BS=1
120+
121+
```
122+
$ deepspeed --num_gpus 8 scripts/inference/bloom-ds-zero-inference.py --name bigscience/bloom --batch_size 1 --benchmark 2>&1 | tee bloom-ds-zero-inference_bs=1.txt
123+
[...]
124+
*** Performance stats:
125+
Throughput per token including tokenize: 282.93 msecs
126+
Start to ready to generate: 501.871 secs
127+
Tokenize and generate 800 (bs=1) tokens: 226.188 secs
128+
Start to finish: 728.060 secs
129+
```
130+
131+
132+
BS=8
133+
134+
```
135+
$ deepspeed --num_gpus 8 scripts/inference/bloom-ds-zero-inference.py --name bigscience/bloom --batch_size 8 --benchmark 2>&1 | tee bloom-ds-zero-inference_bs=8.txt
136+
[...]
137+
138+
*** Performance stats:
139+
Throughput per token including tokenize: 34.57 msecs
140+
Start to ready to generate: 482.132 secs
141+
Tokenize and generate 6400 (bs=8) tokens: 221.236 secs
142+
Start to finish: 703.368 secs
143+
```
144+
145+
BS=16 and higher OOMs
146+
147+
```
148+
$ deepspeed --num_gpus 8 scripts/inference/bloom-ds-zero-inference.py --name bigscience/bloom --batch_size 16 --benchmark 2>&1 | tee bloom-ds-zero-inference_bs=16.txt
149+
[...]
150+
OOM
151+
152+
```
153+
154+
155+
156+
## HF Accelerate
157+
158+
https://github.com/huggingface/accelerate
159+
160+
### Setup
161+
162+
```
163+
pip install transformers
164+
```
165+
166+
167+
168+
### Run
169+
170+
171+
172+
173+
BS=1
174+
```
175+
$ python scripts/inference/bloom-accelerate-inference.py --name bigscience/bloom --batch_size 1 --benchmark 2>&1 | tee bloom-ds-zero-inference_bs=1.txt
176+
[...]
177+
178+
179+
```
180+
181+
BS=8
182+
```
183+
$ python scripts/inference/bloom-accelerate-inference.py --name bigscience/bloom --batch_size 8 --benchmark 2>&1 | tee bloom-ds-zero-inference_bs=8.txt
184+
[...]
185+
186+
187+
```
188+
189+
BS=16
190+
```
191+
$ python scripts/inference/bloom-accelerate-inference.py --name bigscience/bloom --batch_size 16 --benchmark 2>&1 | tee bloom-ds-zero-inference_bs=16.txt
192+
[...]
193+
194+
195+
```
Lines changed: 186 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,186 @@
1+
import argparse
2+
import time
3+
import os
4+
import gc
5+
import torch
6+
import math
7+
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM
8+
9+
def get_args():
10+
parser = argparse.ArgumentParser()
11+
parser.add_argument("--local_rank", required=False, type=int, help="used by dist launchers")
12+
parser.add_argument("--name", type=str, help="Name path", required=True)
13+
parser.add_argument("--batch_size", default=1, type=int, help="batch size")
14+
parser.add_argument("--benchmark", action="store_true", help="additionally run benchmark")
15+
parser.add_argument("--greedy", action="store_true")
16+
parser.add_argument("--top-k", type=int, default=0)
17+
parser.add_argument("--top-p", type=float, default=0.)
18+
19+
return parser.parse_args()
20+
21+
def get_max_memory_per_gpu_dict(dtype, model_name):
22+
""" try to generate the memory map based on what we know about the model and the available hardware """
23+
24+
# figure out the memory map - the minimum per gpu required to load the model
25+
n_gpus = torch.cuda.device_count()
26+
27+
if model_name == "bigscience/bloom" and n_gpus == 8 and torch.cuda.get_device_properties(0).total_memory > 79*2**30:
28+
# hand crafted optimized memory map for 8x80 setup over BLOOM
29+
# this works with bs=40
30+
return {0: '0GIB', 1: '51GIB', 2: '51GIB', 3: '51GIB', 4: '51GIB', 5: '51GIB', 6: '51GIB', 7: '51GIB'}
31+
32+
try:
33+
# model_params calculation, as we don't have a model yet to do:
34+
#model_params = sum(dict((p.data_ptr(), p.numel()) for p in model.parameters()).values())
35+
36+
config = AutoConfig.from_pretrained(model_name)
37+
h = config.n_embed
38+
l = config.n_layer
39+
v = config.vocab_size
40+
# from https://github.com/bigscience-workshop/bigscience/tree/6917a3b5fefcf439d3485ca184b4d9f6ab605150/math#model-sizing
41+
model_params = l*(12*h**2 + 13*h) + v*h + 4*h
42+
except:
43+
print(f"The model {model_name} has a broken config file. Please notify the owner")
44+
raise
45+
46+
bytes = torch.finfo(dtype).bits / 8
47+
param_memory_total_in_bytes = model_params * bytes
48+
# add 5% since weight sizes aren't the same and some GPU may need more memory
49+
param_memory_per_gpu_in_bytes = int(param_memory_total_in_bytes / n_gpus * 1.05)
50+
print(f"Estimating {param_memory_per_gpu_in_bytes/2**30:0.2f}GB per gpu for weights")
51+
52+
# check the real available memory
53+
# load cuda kernels first and only measure the real free memory after loading (shorter by ~2GB)
54+
torch.ones(1).cuda()
55+
max_memory_per_gpu_in_bytes = torch.cuda.mem_get_info(0)[0]
56+
if max_memory_per_gpu_in_bytes < param_memory_per_gpu_in_bytes:
57+
raise ValueError(f"Unable to generate the memory map automatically as the needed estimated memory per gpu ({param_memory_per_gpu_in_bytes/2**30:0.2f}GB) is bigger than the available per gpu memory ({max_memory_per_gpu_in_bytes/2**30:0.2f}GB)")
58+
59+
return {i: param_memory_per_gpu_in_bytes for i in range(torch.cuda.device_count())}
60+
61+
t_start = time.time()
62+
63+
num_tokens = 100
64+
65+
args = get_args()
66+
67+
local_rank = int(os.getenv('LOCAL_RANK', '0'))
68+
world_size = int(os.getenv('WORLD_SIZE', '1'))
69+
70+
rank = local_rank
71+
72+
model_name = args.name
73+
if rank == 0:
74+
print(f"Loading model {model_name}")
75+
76+
77+
tokenizer = AutoTokenizer.from_pretrained(model_name)
78+
79+
# XXX: can't automatically derive dtype via config's `from_pretrained`
80+
dtype = torch.bfloat16 if model_name in ["bigscience/bloom", "bigscience/bigscience-small-testing"] else torch.float16
81+
82+
#print(get_max_memory_per_gpu_dict())
83+
84+
85+
model = AutoModelForCausalLM.from_pretrained(
86+
model_name,
87+
device_map="auto",
88+
max_memory=get_max_memory_per_gpu_dict(dtype, model_name),
89+
torch_dtype=dtype,
90+
)
91+
92+
93+
if args.benchmark:
94+
t_ready = time.time()
95+
96+
97+
98+
### Generate
99+
100+
if rank == 0:
101+
print(f"*** Starting to generate {num_tokens} tokens with bs={args.batch_size}")
102+
103+
input_sentences = [
104+
"DeepSpeed is a machine learning framework",
105+
"He is working on",
106+
"He has a",
107+
"He got all",
108+
"Everyone is happy and I can",
109+
"The new movie that got Oscar this year",
110+
"In the far far distance from our galaxy,",
111+
"Peace is the only way"
112+
]
113+
114+
if args.batch_size > len(input_sentences):
115+
# dynamically extend to support larger bs by repetition
116+
input_sentences *= math.ceil(args.batch_size / len(input_sentences))
117+
118+
generate_kwargs = dict(max_new_tokens=num_tokens, do_sample=False)
119+
#generate_kwargs = dict(max_new_tokens=num_tokens, use_cache=False, do_sample=False)
120+
#generate_kwargs = dict(min_length=num_tokens, max_length=num_tokens, do_sample=False)
121+
122+
if rank == 0:
123+
print(f"Generate args {generate_kwargs}")
124+
inputs = input_sentences[:args.batch_size]
125+
def generate():
126+
""" returns a list of zipped inputs, outputs and number of new tokens """
127+
128+
input_tokens = tokenizer.batch_encode_plus(inputs, return_tensors="pt", padding=True)
129+
for t in input_tokens:
130+
if torch.is_tensor(input_tokens[t]):
131+
input_tokens[t] = input_tokens[t].to("cuda:0")
132+
133+
outputs = model.generate(**input_tokens, **generate_kwargs)
134+
135+
input_tokens_lengths = [x.shape[0] for x in input_tokens.input_ids]
136+
output_tokens_lengths = [x.shape[0] for x in outputs]
137+
138+
total_new_tokens = [o-i for i,o in zip(input_tokens_lengths, output_tokens_lengths)]
139+
outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
140+
141+
return zip(inputs, outputs, total_new_tokens)
142+
143+
# warmup is a must if measuring speed as it's when all the optimizations are performed
144+
# e.g. on 8x80 a100 the first pass of 100 tokens takes 23sec, and the next one is 4secs
145+
_ = generate()
146+
147+
t_generate_start = time.time()
148+
generated = generate()
149+
t_generate_span = time.time() - t_generate_start
150+
if rank == 0:
151+
for i,o,_ in generated:
152+
print(f"{'-'*60}\nin={i}\nout={o}\n")
153+
154+
155+
if args.benchmark:
156+
torch.cuda.empty_cache()
157+
gc.collect()
158+
159+
### Benchmark
160+
161+
if args.benchmark:
162+
if rank == 0:
163+
print(f"*** Running benchmark")
164+
165+
# warm up
166+
for i in range(1):
167+
_ = generate()
168+
torch.cuda.synchronize()
169+
170+
# benchmark
171+
t0 = time.time()
172+
cycles = 5
173+
total_new_tokens_generated = 0
174+
for i in range(cycles):
175+
generated = generate()
176+
total_new_tokens_generated += sum(new_tokens for _,_,new_tokens in generated)
177+
torch.cuda.synchronize()
178+
if rank == 0:
179+
througput = (time.time() - t0)/(total_new_tokens_generated)
180+
print(f"""
181+
*** Performance stats:
182+
Throughput per token including tokenize: {througput*1000:.2f} msecs
183+
Start to ready to generate: {t_ready - t_start:.3f} secs
184+
Tokenize and generate {total_new_tokens_generated} (bs={args.batch_size}) tokens: {t_generate_span:.3f} secs
185+
Start to finish: {t_ready - t_start + t_generate_span:.3f} secs
186+
""")

0 commit comments

Comments
 (0)