@@ -165,7 +165,7 @@ Currently the stats below are calculated based on A100 (single GPU), and we calc
165
165
##### Llama
166
166
167
167
| batch_size | 8 | 16 | 32 |
168
- | :---------------------: | :----: | :----: | :----: |
168
+ | :-----------------------: | :------: | :------: | :------: |
169
169
| hugging-face torch fp16 | 199.12 | 246.56 | 278.4 |
170
170
| colossal-inference | 326.4 | 582.72 | 816.64 |
171
171
@@ -174,7 +174,7 @@ Currently the stats below are calculated based on A100 (single GPU), and we calc
174
174
#### Bloom
175
175
176
176
| batch_size | 8 | 16 | 32 |
177
- | :---------------------: | :----: | :----: | :----: |
177
+ | :-----------------------: | :------: | :------: | :------: |
178
178
| hugging-face torch fp16 | 189.68 | 226.66 | 249.61 |
179
179
| colossal-inference | 323.28 | 538.52 | 611.64 |
180
180
@@ -187,40 +187,40 @@ We conducted multiple benchmark tests to evaluate the performance. We compared t
187
187
188
188
#### A10 7b, fp16
189
189
190
- | batch_size(micro_batch size)| 2(1) | 4(2) | 8(4) | 16(8) | 32(8) | 32(16)|
191
- | :-------------------------: | :---: | :---: | :---: | :---: | :---: | :---: |
192
- | Pipeline Inference | 40.35 | 77.10| 139.03| 232.70| 257.81| OOM |
193
- | Hugging Face | 41.43 | 65.30| 91.93 | 114.62| OOM | OOM |
190
+ | batch_size(micro_batch size) | 2(1) | 4(2) | 8(4) | 16(8) | 32(8) | 32(16) |
191
+ | :----------------------------: | :-----: | :-----: | :------: | :------: | :------: | :------: |
192
+ | Pipeline Inference | 40.35 | 77.10 | 139.03 | 232.70 | 257.81 | OOM |
193
+ | Hugging Face | 41.43 | 65.30 | 91.93 | 114.62 | OOM | OOM |
194
194
195
195
196
196
![ ppllama7b] ( https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/inference/pp-a10-llama7b.png )
197
197
198
198
#### A10 13b, fp16
199
199
200
- | batch_size(micro_batch size)| 2(1) | 4(2) | 8(4) | 16(4) |
201
- | :---: | :---: | :---: | :---: | :---: |
202
- | Pipeline Inference | 25.39 | 47.09 | 83.7 | 89.46 |
203
- | Hugging Face | 23.48 | 37.59 | 53.44 | OOM |
200
+ | batch_size(micro_batch size) | 2(1) | 4(2) | 8(4) | 16(4) |
201
+ | :----------------------------: | :-----: | :-----: | :-----: | :-----: |
202
+ | Pipeline Inference | 25.39 | 47.09 | 83.7 | 89.46 |
203
+ | Hugging Face | 23.48 | 37.59 | 53.44 | OOM |
204
204
205
205
![ ppllama13] ( https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/inference/pp-a10-llama13b.png )
206
206
207
207
208
208
#### A800 7b, fp16
209
209
210
- | batch_size(micro_batch size) | 2(1) | 4(2) | 8(4) | 16(8) | 32(16) |
211
- | :---: | :---: | :---: | :---: | :---: | :---: |
212
- | Pipeline Inference| 57.97 | 110.13 | 213.33 | 389.86 | 670.12 |
213
- | Hugging Face | 42.44 | 76.5 | 151.97 | 212.88 | 256.13 |
210
+ | batch_size(micro_batch size) | 2(1) | 4(2) | 8(4) | 16(8) | 32(16) |
211
+ | :----------------------------: | :-----: | :------: | :------: | :------: | :------: |
212
+ | Pipeline Inference | 57.97 | 110.13 | 213.33 | 389.86 | 670.12 |
213
+ | Hugging Face | 42.44 | 76.5 | 151.97 | 212.88 | 256.13 |
214
214
215
215
![ ppllama7b_a800] ( https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/inference/pp-a800-llama7b.png )
216
216
217
217
### Quantization LLama
218
218
219
- | batch_size | 8 | 16 | 32 |
220
- | :---------------------: | :----: | :----: | :----: |
221
- | auto-gptq | 199.20 | 232.56 | 253.26 |
222
- | smooth-quant | 142.28 | 222.96 | 300.59 |
223
- | colossal-gptq | 231.98 | 388.87 | 573.03 |
219
+ | batch_size | 8 | 16 | 32 |
220
+ | :-------------: | : ------: | :------: | :------: |
221
+ | auto-gptq | 199.20 | 232.56 | 253.26 |
222
+ | smooth-quant | 142.28 | 222.96 | 300.59 |
223
+ | colossal-gptq | 231.98 | 388.87 | 573.03 |
224
224
225
225
![ bloom] ( https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/inference/inference-quant.png )
226
226
0 commit comments