|
2 | 2 |
|
3 | 3 | ## BLOOM Inference solutions |
4 | 4 |
|
5 | | -Here are some stats on JeanZay's 8x80GB A100 node w/ 512GB of CPU memory: |
| 5 | +Here are some benchmark resuls on JeanZay's 8x80GB A100 node w/ 512GB of CPU memory: |
6 | 6 |
|
7 | 7 | All benchmarks are doing greedy generation of 100 token outputs: |
8 | 8 | ``` |
9 | | -Generate args {'min_length': 100, 'max_length': 100, 'do_sample': False} |
| 9 | +Generate args {'max_length': 100, 'do_sample': False} |
10 | 10 | ``` |
11 | | -The inputs are just a few tokens. |
| 11 | +The input prompt is comprised of just a few tokens. |
12 | 12 |
|
13 | | -Throughput in msecs: |
| 13 | +Throughput in msecs on 8x80GB gpus: |
14 | 14 |
|
15 | | -| project \ bs | 1 | 8 | 16 | 32 | 64 | 128 | |
16 | | -| :----------- | :---- | :---- | :---- | :---- | :---- | :--- | |
17 | | -| accelerate | 230.38 | 31.78 | 17.84 | 10.89 | oom | omm | |
18 | | -| ds-inference | 40.57 | 5.23 | | | 2.77 | 0.66 | |
19 | | -| ds-zero | 283 | 34.88 | oom | oom | oom | oom | |
| 15 | +| project \ bs | 1 | 8 | 16 | 32 | 64 | 128 | 256 | 512 | |
| 16 | +| :---------------- | :----- | :---- | :---- | :---- | :--- | :--- | :--- | :--- | |
| 17 | +| accelerate bf16 | 230.38 | 31.78 | 17.84 | 10.89 | oom | | | | |
| 18 | +| accelerate int8 | 286.56 | 40.92 | 22.65 | 13.27 | oom | | | | |
| 19 | +| ds-inference fp16 | 44.02 | 5.70 | 3.01 | 1.68 | 1.00 | 0.69 | oom | | |
| 20 | +| ds-inference int8 | 89.09 | 11.44 | 5.88 | 3.09 | 1.71 | 1.02 | 0.71 | oom | |
| 21 | +| ds-zero | 283 | 34.88 | oom | | | | | | |
| 22 | +| | | | | | | | | | |
20 | 23 |
|
| 24 | +Start to ready to generate in secs (mainly loading and data preparation time): |
21 | 25 |
|
22 | | -Start to ready to generate in secs: |
| 26 | +| project | | |
| 27 | +| :---------------------- | :--- | |
| 28 | +| accelerate | 121 | |
| 29 | +| ds-inference shard-int8 | 61 | |
| 30 | +| ds-inference shard-fp16 | 60 | |
| 31 | +| ds-inference unsharded | 662 | |
| 32 | +| ds-zero | 462 | |
23 | 33 |
|
24 | | -| project \ bs | 1 | 8 | 16 | 32 | 64 | 128 | |
25 | | -| :----------- | :--- | :--- | :--- | :--- | :--- | :--- | |
26 | | -| accelerate | 121 | 120 | 113 | 118 | | | |
27 | | -| ds-inference | 662 | 673 | | | 685 | 654 | |
28 | | -| ds-zero | 462 | 463 | | | | | |
29 | | -| | | | | | | | |
| 34 | +Now let's look at the power of quantized int8-based models provided by Deepspeed-Inference and BitsNBytes, as it requires only half the original GPU memory of inference in bfloat16 or float16. |
30 | 35 |
|
| 36 | +Throughput in msecs 4x80GB A100: |
31 | 37 |
|
32 | | -DS-Inference load time (start to ready to generate) will become much faster soon. Once we stop relying on ds-zero to instantiate the model on gpu. The plan is to pre-shard the weights TP-wise for 8x and 16x gpus and load them directly on each gpu. Will probably be under 1min. |
| 38 | +| project \ bs | 1 | 8 | 16 | 32 | 64 | 128 | |
| 39 | +| :---------------- | :----- | :---- | :---- | :---- | :--- | :--- | |
| 40 | +| accelerate int8 | 284.15 | 40.14 | 21.97 | oom | | | |
| 41 | +| ds-inference int8 | 156.51 | 20.11 | 10.38 | 5.50 | 2.96 | oom | |
| 42 | +| | | | | | | | |
| 43 | + |
| 44 | +To get the benchmark results simply add `--benchmark` to any of these 3 scripts discussed below. |
33 | 45 |
|
34 | 46 |
|
35 | 47 | ## Deepspeed-Inference |
36 | 48 |
|
37 | | -Tensor-Parallelism and efficient fused CUDA kernels: |
| 49 | +Deepspeed-Inference uses Tensor-Parallelism and efficient fused CUDA kernels: |
38 | 50 | https://www.deepspeed.ai/tutorials/inference-tutorial/ |
39 | 51 |
|
40 | 52 | ### Setup |
41 | 53 |
|
42 | 54 | ``` |
43 | | -git clone https://github.com/microsoft/DeepSpeed |
44 | | -cd DeepSpeed |
45 | | -pip install . |
| 55 | +pip install deepspeed>=0.7.3 |
46 | 56 | ``` |
47 | 57 |
|
48 | 58 | ### Run |
49 | 59 |
|
50 | | -``` |
51 | | -deepspeed --num_gpus 8 scripts/inference/bloom-ds-inference.py --name bigscience/bloom |
52 | | -``` |
53 | | - |
54 | | -Performance on a single node of 8x80GB A100 w/ 512GB CPU RAM (JeanZay) - just a batch of 1 (would be more efficient to run a larger batch) |
55 | | - |
56 | | -Adding `--benchmark` to activate the benchmarks |
| 60 | +1. the fastest approach is to use a tp-pre-sharded checkpoint that takes only ~1min to load, as compared to 10min for non-presharded bloom checkpoint |
57 | 61 |
|
58 | 62 |
|
59 | | -BS=1 |
60 | 63 | ``` |
61 | | -$ deepspeed --num_gpus 8 scripts/inference/bloom-ds-inference.py --name bigscience/bloom --batch_size 1 --benchmark 2>&1 | tee bloom-ds-inference_bs=1.txt |
62 | | -[...] |
63 | | -
|
| 64 | +deepspeed --num_gpus 8 scripts/bloom-inference-scripts/bloom-ds-inference.py --name microsoft/bloom-deepspeed-inference-fp16 |
64 | 65 | ``` |
65 | 66 |
|
66 | | -While processing memory per process: |
| 67 | +1a. |
| 68 | +if you want to run the original bloom checkpoint, which once loaded will run at the same throughput as the previous solution, but the loading will take 10-20min: |
67 | 69 |
|
68 | | -- GPU: ~50GB |
69 | | -- CPU: ~10GB |
70 | | - |
71 | | - |
72 | | -BS=8 |
73 | 70 | ``` |
74 | | -$ deepspeed --num_gpus 8 scripts/inference/bloom-ds-inference.py --name bigscience/bloom --batch_size 8 --benchmark 2>&1 | tee bloom-ds-inference_bs=8.txt |
75 | | -[...] |
76 | | -*** Performance stats: |
77 | | -Throughput per token including tokenize: 5.23 msecs |
78 | | -Start to ready to generate: 683.397 secs |
79 | | -Tokenize and generate 800 (bs=8) tokens: 4.241 secs |
80 | | -Start to finish: 687.638 secs |
| 71 | +deepspeed --num_gpus 8 scripts/bloom-inference-scripts/bloom-ds-inference.py --name bigscience/bloom |
81 | 72 | ``` |
82 | 73 |
|
83 | | -BS=64 |
| 74 | +2a. The 8bit quantized version requires you to have only half the GPU memory of the normal half precision version: |
84 | 75 |
|
85 | | -``` |
86 | | -$ deepspeed --num_gpus 8 scripts/inference/bloom-ds-inference.py --name bigscience/bloom --batch_size 64 --benchmark 2>&1 | tee bloom-ds-inference_bs=64.txt |
87 | 76 |
|
| 77 | +``` |
| 78 | +deepspeed --num_gpus 8 scripts/bloom-inference-scripts/bloom-ds-inference.py --name microsoft/bloom-deepspeed-inference-int8 --dtype int8 |
| 79 | +``` |
88 | 80 |
|
| 81 | +Here we used `microsoft/bloom-deepspeed-inference-int8` and also told the script to run in `int8`. |
89 | 82 |
|
| 83 | +And of course, just 4x80GB A100 gpus is now sufficient: |
90 | 84 |
|
91 | 85 | ``` |
92 | | - |
93 | | -BS=128 |
94 | | - |
| 86 | +deepspeed --num_gpus 4 scripts/bloom-inference-scripts/bloom-ds-inference.py --name microsoft/bloom-deepspeed-inference-int8 --dtype int8 |
95 | 87 | ``` |
96 | | -$ deepspeed --num_gpus 8 scripts/inference/bloom-ds-inference.py --name bigscience/bloom --batch_size 128 --benchmark 2>&1 | tee bloom-ds-inference_bs=128.txt |
97 | | -
|
98 | 88 |
|
99 | 89 |
|
100 | 90 |
|
101 | | -``` |
102 | | - |
103 | | -## Deepspeed ZeRO-Inference |
| 91 | +## HF Accelerate |
104 | 92 |
|
105 | | -https://www.deepspeed.ai/tutorials/zero/ |
| 93 | +HF Accelerate can use naive Pipeline Parallelism to load a huge model over multiple GPUs: |
| 94 | +https://github.com/huggingface/accelerate |
106 | 95 |
|
107 | 96 | ### Setup |
108 | 97 |
|
109 | 98 | ``` |
110 | | -pip install deepspeed |
| 99 | +pip install transformers>=4.21.3 accelerate>=0.12.0 |
111 | 100 | ``` |
112 | 101 |
|
113 | 102 |
|
114 | 103 | ### Run |
115 | 104 |
|
116 | | -Note that the script currently runs the same inputs on all GPUs, but you can run a different stream on each GPU, and get `n_gpu` times faster throughput. You can't do that with Deepspeed-Inference. |
117 | | - |
118 | | - |
119 | | -BS=1 |
120 | 105 |
|
121 | 106 | ``` |
122 | | -$ deepspeed --num_gpus 8 scripts/inference/bloom-ds-zero-inference.py --name bigscience/bloom --batch_size 1 --benchmark 2>&1 | tee bloom-ds-zero-inference_bs=1.txt |
123 | | -[...] |
124 | | -*** Performance stats: |
125 | | -Throughput per token including tokenize: 282.93 msecs |
126 | | -Start to ready to generate: 501.871 secs |
127 | | -Tokenize and generate 800 (bs=1) tokens: 226.188 secs |
128 | | -Start to finish: 728.060 secs |
| 107 | +python scripts/bloom-inference-scripts/bloom-accelerate-inference.py --name bigscience/bloom --batch_size 1 --benchmark 2>&1 | tee bloom-ds-zero-inference_bs=1.txt |
129 | 108 | ``` |
130 | 109 |
|
131 | | - |
132 | | -BS=8 |
| 110 | +To activate the 8bit quantized solution first install `bitsnbytes`: |
133 | 111 |
|
134 | 112 | ``` |
135 | | -$ deepspeed --num_gpus 8 scripts/inference/bloom-ds-zero-inference.py --name bigscience/bloom --batch_size 8 --benchmark 2>&1 | tee bloom-ds-zero-inference_bs=8.txt |
136 | | -[...] |
137 | | -
|
138 | | -*** Performance stats: |
139 | | -Throughput per token including tokenize: 34.57 msecs |
140 | | -Start to ready to generate: 482.132 secs |
141 | | -Tokenize and generate 6400 (bs=8) tokens: 221.236 secs |
142 | | -Start to finish: 703.368 secs |
| 113 | +pip install bitsandbytes |
143 | 114 | ``` |
144 | 115 |
|
145 | | -BS=16 and higher OOMs |
| 116 | +and then add `--dtype int8` to the previous command line: |
146 | 117 |
|
147 | 118 | ``` |
148 | | -$ deepspeed --num_gpus 8 scripts/inference/bloom-ds-zero-inference.py --name bigscience/bloom --batch_size 16 --benchmark 2>&1 | tee bloom-ds-zero-inference_bs=16.txt |
149 | | -[...] |
150 | | -OOM |
151 | | -
|
| 119 | +python scripts/bloom-inference-scripts/bloom-accelerate-inference.py --name bigscience/bloom --dtype int8 --batch_size 1 --benchmark 2>&1 | tee bloom-int8-accelerate-inference_bs=4.txt |
152 | 120 | ``` |
153 | 121 |
|
| 122 | +if you have more that 4 GPUs you can tell it to use only 4 with: |
| 123 | +``` |
| 124 | +CUDA_VISIBLE_DEVICES=0,1,2,3 python scripts/bloom-inference-scripts/bloom-accelerate-inference.py --name bigscience/bloom --dtype int8 --batch_size 1 --benchmark 2>&1 | tee bloom-int8-accelerate-inference_bs=4.txt |
| 125 | +``` |
154 | 126 |
|
155 | 127 |
|
156 | | -## HF Accelerate |
| 128 | +## Deepspeed ZeRO-Inference |
157 | 129 |
|
158 | | -https://github.com/huggingface/accelerate |
| 130 | +https://www.deepspeed.ai/tutorials/zero/ |
159 | 131 |
|
160 | 132 | ### Setup |
161 | 133 |
|
162 | 134 | ``` |
163 | | -pip install transformers |
| 135 | +pip install deepspeed |
164 | 136 | ``` |
165 | 137 |
|
166 | 138 |
|
167 | | - |
168 | 139 | ### Run |
169 | 140 |
|
| 141 | +Note that the script currently runs the same inputs on all GPUs, but you can run a different stream on each GPU, and get `n_gpu` times faster throughput. You can't do that with Deepspeed-Inference. |
170 | 142 |
|
171 | 143 |
|
172 | | - |
173 | | -BS=1 |
174 | 144 | ``` |
175 | | -$ python scripts/inference/bloom-accelerate-inference.py --name bigscience/bloom --batch_size 1 --benchmark 2>&1 | tee bloom-ds-zero-inference_bs=1.txt |
176 | | -[...] |
177 | | -
|
178 | | -
|
| 145 | +deepspeed --num_gpus 8 scripts/bloom-inference-scripts/bloom-ds-zero-inference.py --name bigscience/bloom --batch_size 1 --benchmark 2>&1 | tee bloom-ds-zero-inference_bs=1.txt |
179 | 146 | ``` |
180 | 147 |
|
181 | | -BS=8 |
182 | | -``` |
183 | | -$ python scripts/inference/bloom-accelerate-inference.py --name bigscience/bloom --batch_size 8 --benchmark 2>&1 | tee bloom-ds-zero-inference_bs=8.txt |
184 | | -[...] |
| 148 | +You can also try the offloading solutions with just one small GPU, which will take a long time to run, but if you don't have 8 huge GPUs this is as good as it gets. |
185 | 149 |
|
186 | 150 |
|
| 151 | +CPU-Offload (1x gpus): |
187 | 152 | ``` |
188 | | - |
189 | | -BS=16 |
| 153 | +deepspeed --num_gpus 1 scripts/bloom-inference-scripts/bloom-ds-zero-inference.py --name bigscience/bloom --batch_size 8 --cpu_offload --benchmark 2>&1 | tee bloom-ds-zero-inference-cpu_offload_bs=8.txt |
190 | 154 | ``` |
191 | | -$ python scripts/inference/bloom-accelerate-inference.py --name bigscience/bloom --batch_size 16 --benchmark 2>&1 | tee bloom-ds-zero-inference_bs=16.txt |
192 | | -[...] |
193 | | -
|
194 | 155 |
|
| 156 | +NVMe-Offload (1x gpus): |
195 | 157 | ``` |
| 158 | +deepspeed --num_gpus 1 scripts/bloom-inference-scripts/bloom-ds-zero-inference.py --name bigscience/bloom --batch_size 8 --nvme_offload_path=/path/to/nvme_offload --benchmark 2>&1 | tee bloom-ds-zero-inference-nvme_offload_bs=8.txt |
| 159 | +``` |
| 160 | + |
| 161 | +make sure to adjust `/path/to/nvme_offload` to somewhere you have ~400GB of free memory on a fast NVMe drive. |
0 commit comments