bigscience-workshop
diff --git a/‎scripts/bloom-inference-scripts/README.md‎
Lines changed: 69 additions & 103 deletions b/‎scripts/bloom-inference-scripts/README.md‎
Lines changed: 69 additions & 103 deletions
@@ -2,194 +2,160 @@
 
 ## BLOOM Inference solutions
 
-Here are some stats on JeanZay's 8x80GB A100 node w/ 512GB of CPU memory:
+Here are some benchmark resuls on JeanZay's 8x80GB A100 node w/ 512GB of CPU memory:
 
 All benchmarks are doing greedy generation of 100 token outputs:
 ```
-Generate args {'min_length': 100, 'max_length': 100, 'do_sample': False}
+Generate args {'max_length': 100, 'do_sample': False}
 ```
-The inputs are just a few tokens.
+The input prompt is comprised of just a few tokens.
 
-Throughput in msecs:
+Throughput in msecs on 8x80GB gpus:
 
-| project \ bs |      1 |     8 |    16 |    32 |    64 |  128 |
-| :----------- |  :---- | :---- | :---- | :---- | :---- | :--- |
-| accelerate   | 230.38 | 31.78 | 17.84 | 10.89 |  oom  | omm  |
-| ds-inference |  40.57 |  5.23 |       |       |  2.77 | 0.66 |
-| ds-zero      |    283 | 34.88 | oom   |  oom  |  oom  | oom  |
+| project      \ bs |      1 |     8 |    16 |    32 |   64 |  128 |  256 | 512  |
+| :---------------- | :----- | :---- | :---- | :---- | :--- | :--- | :--- | :--- |
+| accelerate   bf16 | 230.38 | 31.78 | 17.84 | 10.89 |  oom |      |      |      |
+| accelerate   int8 | 286.56 | 40.92 | 22.65 | 13.27 |  oom |      |      |      |
+| ds-inference fp16 |  44.02 |  5.70 |  3.01 |  1.68 | 1.00 | 0.69 |  oom |      |
+| ds-inference int8 |  89.09 | 11.44 |  5.88 |  3.09 | 1.71 | 1.02 | 0.71 | oom  |
+| ds-zero           |    283 | 34.88 |   oom |       |      |      |      |      |
+|                   |        |       |       |       |      |      |      |      |
 
+Start to ready to generate in secs (mainly loading and data preparation time):
 
-Start to ready to generate in secs:
+| project                 |      |
+| :---------------------- | :--- |
+| accelerate              |  121 |
+| ds-inference shard-int8 |   61 |
+| ds-inference shard-fp16 |   60 |
+| ds-inference unsharded  |  662 |
+| ds-zero                 |  462 |
 
-| project \ bs |    1 |    8 |   16 |   32 |   64 |  128 |
-| :----------- | :--- | :--- | :--- | :--- | :--- | :--- |
-| accelerate   |  121 |  120 |  113 |  118 |      |      |
-| ds-inference |  662 |  673 |      |      |  685 |  654 |
-| ds-zero      |  462 |  463 |      |      |      |      |
-|              |      |      |      |      |      |      |
+Now let's look at the power of quantized int8-based models provided by Deepspeed-Inference and BitsNBytes, as it requires only half the original GPU memory of inference in bfloat16 or float16.
 
+Throughput in msecs 4x80GB A100:
 
-DS-Inference load time (start to ready to generate) will become much faster soon. Once we stop relying on ds-zero to instantiate the model on gpu. The plan is to pre-shard the weights TP-wise for 8x and 16x gpus and load them directly on each gpu. Will probably be under 1min.
+| project      \ bs |      1 |     8 |    16 |    32 |   64 | 128  |
+| :---------------- | :----- | :---- | :---- | :---- | :--- | :--- |
+| accelerate   int8 | 284.15 | 40.14 | 21.97 |  oom  |      |      |
+| ds-inference int8 | 156.51 | 20.11 | 10.38 |  5.50 | 2.96 | oom  |
+|                   |        |       |       |       |      |      |
+
+To get the benchmark results simply add `--benchmark` to any of these 3 scripts discussed below.
 
 
 ## Deepspeed-Inference
 
-Tensor-Parallelism and efficient fused CUDA kernels:
+Deepspeed-Inference uses Tensor-Parallelism and efficient fused CUDA kernels:
 https://www.deepspeed.ai/tutorials/inference-tutorial/
 
 ### Setup
 
 ```
-git clone https://github.com/microsoft/DeepSpeed
-cd DeepSpeed
-pip install .
+pip install deepspeed>=0.7.3
 ```
 
 ### Run
 
-```
-deepspeed --num_gpus 8 scripts/inference/bloom-ds-inference.py --name bigscience/bloom
-```
-
-Performance on a single node of 8x80GB A100 w/ 512GB CPU RAM (JeanZay) - just a batch of 1 (would be more efficient to run a larger batch)
-
-Adding `--benchmark` to activate the benchmarks
+1. the fastest approach is to use a tp-pre-sharded checkpoint that takes only ~1min to load, as compared to 10min for non-presharded bloom checkpoint
 
 
-BS=1
 ```
-$ deepspeed --num_gpus 8 scripts/inference/bloom-ds-inference.py --name bigscience/bloom --batch_size 1 --benchmark 2>&1 | tee bloom-ds-inference_bs=1.txt
-[...]
-
+deepspeed --num_gpus 8 scripts/bloom-inference-scripts/bloom-ds-inference.py --name microsoft/bloom-deepspeed-inference-fp16
 ```
 
-While processing memory per process:
+1a.
+if you want to run the original bloom checkpoint, which once loaded will run at the same throughput as the previous solution, but the loading will take 10-20min:
 
--  GPU: ~50GB
--  CPU: ~10GB
-
-
-BS=8
 ```
-$ deepspeed --num_gpus 8 scripts/inference/bloom-ds-inference.py --name bigscience/bloom --batch_size 8 --benchmark 2>&1 | tee bloom-ds-inference_bs=8.txt
-[...]
-*** Performance stats:
-Throughput per token including tokenize: 5.23 msecs
-Start to ready to generate: 683.397 secs
-Tokenize and generate 800 (bs=8) tokens: 4.241 secs
-Start to finish: 687.638 secs
+deepspeed --num_gpus 8 scripts/bloom-inference-scripts/bloom-ds-inference.py --name bigscience/bloom
 ```
 
-BS=64
+2a. The 8bit quantized version requires you to have only half the GPU memory of the normal half precision version:
 
-```
-$ deepspeed --num_gpus 8 scripts/inference/bloom-ds-inference.py --name bigscience/bloom --batch_size 64 --benchmark 2>&1 | tee bloom-ds-inference_bs=64.txt
 
+```
+deepspeed --num_gpus 8 scripts/bloom-inference-scripts/bloom-ds-inference.py --name microsoft/bloom-deepspeed-inference-int8 --dtype int8
+```
 
+Here we used `microsoft/bloom-deepspeed-inference-int8` and also told the script to run in `int8`.
 
+And of course, just 4x80GB A100 gpus is now sufficient:
 
 ```
-
-BS=128
-
+deepspeed --num_gpus 4 scripts/bloom-inference-scripts/bloom-ds-inference.py --name microsoft/bloom-deepspeed-inference-int8 --dtype int8
 ```
-$ deepspeed --num_gpus 8 scripts/inference/bloom-ds-inference.py --name bigscience/bloom --batch_size 128 --benchmark 2>&1 | tee bloom-ds-inference_bs=128.txt
-
 
 
 
-```
-
-## Deepspeed ZeRO-Inference
+## HF Accelerate
 
-https://www.deepspeed.ai/tutorials/zero/
+HF Accelerate can use naive Pipeline Parallelism to load a huge model over multiple GPUs:
+https://github.com/huggingface/accelerate
 
 ### Setup
 
 ```
-pip install deepspeed
+pip install transformers>=4.21.3 accelerate>=0.12.0
 ```
 
 
 ### Run
 
-Note that the script currently runs the same inputs on all GPUs, but you can run a different stream on each GPU, and get `n_gpu` times faster throughput. You can't do that with Deepspeed-Inference.
-
-
-BS=1
 
 ```
-$ deepspeed --num_gpus 8 scripts/inference/bloom-ds-zero-inference.py --name bigscience/bloom --batch_size 1 --benchmark 2>&1 | tee bloom-ds-zero-inference_bs=1.txt
-[...]
-*** Performance stats:
-Throughput per token including tokenize: 282.93 msecs
-Start to ready to generate: 501.871 secs
-Tokenize and generate 800 (bs=1) tokens: 226.188 secs
-Start to finish: 728.060 secs
+python scripts/bloom-inference-scripts/bloom-accelerate-inference.py --name bigscience/bloom --batch_size 1 --benchmark 2>&1 | tee bloom-ds-zero-inference_bs=1.txt
 ```
 
-
-BS=8
+To activate the 8bit quantized solution first install `bitsnbytes`:
 
 ```
-$ deepspeed --num_gpus 8 scripts/inference/bloom-ds-zero-inference.py --name bigscience/bloom --batch_size 8 --benchmark 2>&1 | tee bloom-ds-zero-inference_bs=8.txt
-[...]
-
-*** Performance stats:
-Throughput per token including tokenize: 34.57 msecs
-Start to ready to generate: 482.132 secs
-Tokenize and generate 6400 (bs=8) tokens: 221.236 secs
-Start to finish: 703.368 secs
+pip install bitsandbytes
 ```
 
-BS=16 and higher OOMs
+and then add `--dtype int8` to the previous command line:
 
 ```
-$ deepspeed --num_gpus 8 scripts/inference/bloom-ds-zero-inference.py --name bigscience/bloom --batch_size 16 --benchmark 2>&1 | tee bloom-ds-zero-inference_bs=16.txt
-[...]
-OOM
-
+python scripts/bloom-inference-scripts/bloom-accelerate-inference.py --name bigscience/bloom --dtype int8 --batch_size 1 --benchmark 2>&1 | tee bloom-int8-accelerate-inference_bs=4.txt
 ```
 
+if you have more that 4 GPUs you can tell it to use only 4 with:
+```
+CUDA_VISIBLE_DEVICES=0,1,2,3 python scripts/bloom-inference-scripts/bloom-accelerate-inference.py --name bigscience/bloom --dtype int8 --batch_size 1 --benchmark 2>&1 | tee bloom-int8-accelerate-inference_bs=4.txt
+```
 
 
-## HF Accelerate
+## Deepspeed ZeRO-Inference
 
-https://github.com/huggingface/accelerate
+https://www.deepspeed.ai/tutorials/zero/
 
 ### Setup
 
 ```
-pip install transformers
+pip install deepspeed
 ```
 
 
-
 ### Run
 
+Note that the script currently runs the same inputs on all GPUs, but you can run a different stream on each GPU, and get `n_gpu` times faster throughput. You can't do that with Deepspeed-Inference.
 
 
-
-BS=1
 ```
-$ python scripts/inference/bloom-accelerate-inference.py --name bigscience/bloom --batch_size 1 --benchmark 2>&1 | tee bloom-ds-zero-inference_bs=1.txt
-[...]
-
-
+deepspeed --num_gpus 8 scripts/bloom-inference-scripts/bloom-ds-zero-inference.py --name bigscience/bloom --batch_size 1 --benchmark 2>&1 | tee bloom-ds-zero-inference_bs=1.txt
 ```
 
-BS=8
-```
-$ python scripts/inference/bloom-accelerate-inference.py --name bigscience/bloom --batch_size 8 --benchmark 2>&1 | tee bloom-ds-zero-inference_bs=8.txt
-[...]
+You can also try the offloading solutions with just one small GPU, which will take a long time to run, but if you don't have 8 huge GPUs this is as good as it gets.
 
 
+CPU-Offload (1x gpus):
 ```
-
-BS=16
+deepspeed --num_gpus 1 scripts/bloom-inference-scripts/bloom-ds-zero-inference.py --name bigscience/bloom --batch_size 8 --cpu_offload --benchmark 2>&1 | tee bloom-ds-zero-inference-cpu_offload_bs=8.txt
 ```
-$ python scripts/inference/bloom-accelerate-inference.py --name bigscience/bloom --batch_size 16 --benchmark 2>&1 | tee bloom-ds-zero-inference_bs=16.txt
-[...]
-
 
+NVMe-Offload (1x gpus):
 ```
+deepspeed --num_gpus 1 scripts/bloom-inference-scripts/bloom-ds-zero-inference.py --name bigscience/bloom --batch_size 8 --nvme_offload_path=/path/to/nvme_offload --benchmark 2>&1 | tee bloom-ds-zero-inference-nvme_offload_bs=8.txt
+```
+
+make sure to adjust `/path/to/nvme_offload` to somewhere you have ~400GB of free memory on a fast NVMe drive.