Skip to content

Commit 4a7bb88

Browse files
authored
[bloom inference scripts] improvements (#345)
* [bloom inference scripts] improvements * wip
1 parent cd597c8 commit 4a7bb88

File tree

4 files changed

+18
-20
lines changed

4 files changed

+18
-20
lines changed

scripts/bloom-inference-scripts/README.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,8 +18,9 @@ Throughput in msecs on 8x80GB gpus:
1818
| accelerate int8 | 286.56 | 40.92 | 22.65 | 13.27 | oom | | | |
1919
| ds-inference fp16 | 44.02 | 5.70 | 3.01 | 1.68 | 1.00 | 0.69 | oom | |
2020
| ds-inference int8 | 89.09 | 11.44 | 5.88 | 3.09 | 1.71 | 1.02 | 0.71 | oom |
21-
| ds-zero | 283 | 34.88 | oom | | | | | |
22-
| | | | | | | | | |
21+
| ds-zero bf16 | 283 | 34.88 | oom | | | | | |
22+
23+
note: Since Deepspeed-ZeRO can process multiple generate streams in parallel its throughput can be further divided by 8 or 16, depending on whether 8 or 16 gpus were used during the generate. and, of course, it means that it can process a bs of 64 in the case of 8x80 A100 (the table above).
2324

2425
Start to ready to generate in secs (mainly loading and data preparation time):
2526

@@ -39,7 +40,6 @@ Throughput in msecs 4x80GB A100:
3940
| :---------------- | :----- | :---- | :---- | :---- | :--- | :--- |
4041
| accelerate int8 | 284.15 | 40.14 | 21.97 | oom | | |
4142
| ds-inference int8 | 156.51 | 20.11 | 10.38 | 5.50 | 2.96 | oom |
42-
| | | | | | | |
4343

4444
To get the benchmark results simply add `--benchmark` to any of these 3 scripts discussed below.
4545

@@ -145,6 +145,8 @@ Note that the script currently runs the same inputs on all GPUs, but you can run
145145
deepspeed --num_gpus 8 scripts/bloom-inference-scripts/bloom-ds-zero-inference.py --name bigscience/bloom --batch_size 1 --benchmark 2>&1 | tee bloom-ds-zero-inference_bs=1.txt
146146
```
147147

148+
Please remember that with ZeRO the user can generate multiple unique streams at the same time - and thus the overall performance should be throughput in secs/token divided by number of participating gpus - so 8x to 16x faster depending on whether 8 or 16 gpus were used!
149+
148150
You can also try the offloading solutions with just one small GPU, which will take a long time to run, but if you don't have 8 huge GPUs this is as good as it gets.
149151

150152

scripts/bloom-inference-scripts/bloom-accelerate-inference.py

Lines changed: 4 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -163,26 +163,22 @@ def generate():
163163

164164
return zip(inputs, outputs, total_new_tokens)
165165

166-
# warmup is a must if measuring speed as it's when all the optimizations are performed
167-
# e.g. on 8x80 a100 the first pass of 100 tokens takes 23sec, and the next one is 4secs
168-
_ = generate()
169-
166+
print_rank0(f"*** Running generate")
170167
t_generate_start = time.time()
171168
generated = generate()
172169
t_generate_span = time.time() - t_generate_start
173170
for i,o,_ in generated:
174171
print_rank0(f"{'-'*60}\nin={i}\nout={o}\n")
175172

176173

174+
### Benchmark
175+
177176
if args.benchmark:
177+
# clear cache / free memory
178178
torch.cuda.empty_cache()
179179
gc.collect()
180180

181-
### Benchmark
182-
183-
if args.benchmark:
184181
print_rank0(f"*** Running benchmark")
185-
186182
# warm up
187183
for i in range(1):
188184
_ = generate()

scripts/bloom-inference-scripts/bloom-ds-inference.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -251,8 +251,10 @@ def generate():
251251

252252
# warmup is a must if measuring speed as it's when all the optimizations are performed
253253
# e.g. on 8x80 a100 the first pass of 100 tokens takes 23sec, and the next one is 4secs
254+
print_rank0(f"*** Running generate warmup")
254255
_ = generate()
255256

257+
print_rank0(f"*** Running generate")
256258
t_generate_start = time.time()
257259
generated = generate()
258260
t_generate_span = time.time() - t_generate_start

scripts/bloom-inference-scripts/bloom-ds-zero-inference.py

Lines changed: 7 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -105,7 +105,7 @@ def print_rank0(*msg):
105105
device="nvme",
106106
pin_memory=True,
107107
nvme_path=args.nvme_offload_path,
108-
buffer_size=6e8
108+
buffer_size=4e9,
109109
)
110110

111111
dschf = HfDeepSpeedConfig(ds_config) # this tells from_pretrained to instantiate directly on gpus
@@ -130,6 +130,7 @@ def print_rank0(*msg):
130130

131131
if args.benchmark:
132132
t_ready = time.time()
133+
deepspeed.runtime.utils.see_memory_usage('start-of-generate', force=True)
133134

134135

135136
### Generate
@@ -175,25 +176,22 @@ def generate():
175176

176177
# XXX: this is currently doing world_size streams on world_size gpus, so we can feed it different inputs on each! and hence the time can be divided by world_size
177178

178-
# warmup is a must if measuring speed as it's when all the optimizations are performed
179-
# e.g. on 8x80 a100 the first pass of 100 tokens takes 23sec, and the next one is 4secs
180-
_ = generate()
181-
179+
print_rank0(f"*** Running generate")
182180
t_generate_start = time.time()
183181
pairs = generate()
184182
t_generate_span = time.time() - t_generate_start
185183
for i,o,_ in pairs:
186184
print_rank0(f"{'-'*60}\nin={i}\nout={o}\n")
187185

188186

187+
### Benchmark
188+
189189
if args.benchmark:
190+
# clear cache / free memory
190191
torch.cuda.empty_cache()
191192
gc.collect()
192-
deepspeed.runtime.utils.see_memory_usage('end-of-run', force=True)
193+
deepspeed.runtime.utils.see_memory_usage('end-of-generate', force=True)
193194

194-
### Benchmark
195-
196-
if args.benchmark:
197195
print_rank0(f"*** Running benchmark")
198196

199197
# warm up

0 commit comments

Comments
 (0)