Rough Notes: BUILD_ARGS="--build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy --build-arg no_proxy" IMAGE_TAG=oh-textgen-1.22.0-ub24-oh-v1.18.1-deepseek-1.22.0 docker build $BUILD_ARGS -f -t $IMAGE_TAG Dockerfile-1.22.0-ub24-oh-v1.18.1 .
This section contains an example of how to quantize a Hugging Face models from fp32 to fp8 with Intel Gaudi and the Optimum for Intel Gaudi (aka Optimum Habana) library. An easy benchmarking python scripts with related Dockefile is also provided. Hugging Face pipelines take advantage of the Hugging Face Tasks in transformer models, such as text generation, translation, question answering and more. You can read more about Hugging Face pipelines on their main page here
A jupyter notebook with fp8 instructions and a Benchmark.py for easy benchmarking are provided. For learning purpose, the jupyter notebook also has instructions on bare metal to get started. For Gaudi benchmarking purpose, Benchmark.py script will run Llama2 70b, Llama3.1 8b, Llama3.1 70b, and Llama3.1 405b inside docker and generate a report with performance comparsion against published numbers in Gaudi Model Performance.
Please make sure to follow Driver Installation to install Gaudi driver on the system.
Please follow README to setup environment for Jupyter notebook.
To use dockerfile provided for the sample, please follow Docker Installation to setup habana runtime for Docker images.
To build the image from the Dockerfile, please follow below command to build the optimum-habana-text-gen image.
docker build --no-cache -t optimum-habana-text-gen:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f Dockerfile .After docker build, users could follow below command to run and docker instance and users will be in the docker instance under text-generation folder.
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=ALL --privileged=true --net=host --ipc=host optimum-habana-text-gen:latestNote
The Huggingface model file size might be large, so we recommend to use an external disk as Huggingface hub folder.
Please export HF_HOME environment variable to your external disk and then export the mount point into docker instance.
ex: "-e HF_HOME=/mnt/huggingface -v /mnt:/mnt"
Benchmark script will run all the models with different input len, output len and batch size and generate a report to compare all published numbers in Gaudi Model Performance.
Different json file are provided for different Gaudi Software version like 1.19 and 1.20 on Gaudi3. To do benchmarking on a machine with 8 Gaudi3 cards, just run the below command inside the docker instance.
python3 Benchmark.pyTo do benchmarking on a machine with 8 Gaudi2 cards, just run the below command instead inside the docker instance.
GAUDI_VER=2 python3 Benchmark.pyTo skip tests for different models, pass related environment and assign its value to 1.
For example, skip llama3.1 405B model test by following command.
skip_llama31_405b=1 python3 Benchmark.pyHere are all supported environment variables to pass different tests :
skip_llama2_70b, skip_llama31_8b, skip_llama31_70b, skip_llama33_70b, skip_llama31_405b
A html report will be generated under a folder with timestamp, and the html report will look like below the diagram.
NOTE: There is also a PerfSpect Report for detailed system and Gaudi information.