diff --git a/README.md b/README.md index cb5cc9a0..f5690261 100644 --- a/README.md +++ b/README.md @@ -45,6 +45,7 @@ The [`examples`](./examples) directory contains examples for using the container | --------- | ------------------------------------------------------------------------------------------------------------------------------------------ | --------------------------------------------------------------------------- | | Vertex AI | [examples/vertex-ai/notebooks/trl-lora-sft-fine-tuning-on-vertex-ai](./examples/vertex-ai/notebooks/trl-lora-sft-fine-tuning-on-vertex-ai) | Fine-tune Gemma 2B with PyTorch Training DLC using SFT + LoRA on Vertex AI | | Vertex AI | [examples/vertex-ai/notebooks/trl-full-sft-fine-tuning-on-vertex-ai](./examples/vertex-ai/notebooks/trl-full-sft-fine-tuning-on-vertex-ai) | Fine-tune Mistral 7B v0.3 with PyTorch Training DLC using SFT on Vertex AI | +| Vertex AI | [examples/vertex-ai/pipelines/fine-tune-paligemma-2-with-pytorch](./examples/vertex-ai/pipelines/fine-tune-paligemma-2-with-pytorch) | Fine-tune PaliGemma 2 with PyTorch Training DLC on Vertex AI | | GKE | [examples/gke/trl-full-fine-tuning](./examples/gke/trl-full-fine-tuning) | Fine-tune Gemma 2B with PyTorch Training DLC using SFT on GKE | | GKE | [examples/gke/trl-lora-fine-tuning](./examples/gke/trl-lora-fine-tuning) | Fine-tune Mistral 7B v0.3 with PyTorch Training DLC using SFT + LoRA on GKE | @@ -64,6 +65,7 @@ The [`examples`](./examples) directory contains examples for using the container | GKE | [examples/gke/tgi-llama-405b-deployment](./examples/gke/tgi-llama-405b-deployment) | Deploy Llama 3.1 405B with TGI DLC on GKE | | GKE | [examples/gke/tgi-llama-vision-deployment](./examples/gke/tgi-llama-vision-deployment) | Deploy Llama 3.2 11B Vision with TGI DLC on GKE | | GKE | [examples/gke/tgi-deployment](./examples/gke/tgi-deployment) | Deploy Meta Llama 3 8B with TGI DLC on GKE | +| GKE | [examples/gke/deploy-paligemma-2-with-tgi](./examples/gke/deploy-paligemma-2-with-tgi) | Deploy PaliGemma 2 with TGI DLC on GKE | | GKE | [examples/gke/tgi-from-gcs-deployment](./examples/gke/tgi-from-gcs-deployment) | Deploy Qwen2 7B with TGI DLC from GCS on GKE | | GKE | [examples/gke/tei-deployment](./examples/gke/tei-deployment) | Deploy Snowflake's Arctic Embed with TEI DLC on GKE | | Cloud Run | [examples/cloud-run/deploy-gemma-2-on-cloud-run](./examples/cloud-run/deploy-gemma-2-on-cloud-run) | Deploy Gemma2 9B with TGI DLC on Cloud Run | diff --git a/docs/source/resources.mdx b/docs/source/resources.mdx index 6f70c7e6..a47ca866 100644 --- a/docs/source/resources.mdx +++ b/docs/source/resources.mdx @@ -40,6 +40,7 @@ Learn how to use Hugging Face in Google Cloud by reading our blog posts, present - [Fine-tune Gemma 2B with PyTorch Training DLC using SFT + LoRA on Vertex AI](https://github.com/huggingface/Google-Cloud-Containers/tree/main/examples/vertex-ai/notebooks/trl-lora-sft-fine-tuning-on-vertex-ai) - [Fine-tune Mistral 7B v0.3 with PyTorch Training DLC using SFT on Vertex AI](https://github.com/huggingface/Google-Cloud-Containers/tree/main/examples/vertex-ai/notebooks/trl-full-sft-fine-tuning-on-vertex-ai) + - [Fine-tune PaliGemma 2 with PyTorch Training DLC on Vertex AI](https://github.com/huggingface/Google-Cloud-Containers/tree/main/examples/vertex-ai/pipelines/fine-tune-paligemma-2-with-pytorch) - Evaluation @@ -54,6 +55,7 @@ Learn how to use Hugging Face in Google Cloud by reading our blog posts, present - [Deploy Llama 3.1 405B with TGI DLC on GKE](https://github.com/huggingface/Google-Cloud-Containers/tree/main/examples/gke/tgi-llama-405b-deployment) - [Deploy Llama 3.2 11B Vision with TGI DLC on GKE](https://github.com/huggingface/Google-Cloud-Containers/tree/main/examples/gke/tgi-llama-vision-deployment) - [Deploy Meta Llama 3 8B with TGI DLC on GKE](https://github.com/huggingface/Google-Cloud-Containers/tree/main/examples/gke/tgi-deployment) + - [Deploy PaliGemma 2 with TGI DLC on GKE](https://github.com/huggingface/Google-Cloud-Containers/tree/main/examples/gke/deploy-paligemma-2-with-tgi) - [Deploy Qwen2 7B with TGI DLC from GCS on GKE](https://github.com/huggingface/Google-Cloud-Containers/tree/main/examples/gke/tgi-from-gcs-deployment) - [Deploy Snowflake's Arctic Embed with TEI DLC on GKE](https://github.com/huggingface/Google-Cloud-Containers/tree/main/examples/gke/tei-deployment) diff --git a/examples/gke/README.md b/examples/gke/README.md index 8bb0057a..4ff9d0b3 100644 --- a/examples/gke/README.md +++ b/examples/gke/README.md @@ -13,6 +13,7 @@ This directory contains usage examples of the Hugging Face Deep Learning Contain | Example | Title | | ------------------------------------------------------------ | ------------------------------------------------------------- | +| [deploy-paligemma-2-with-tgi](./deploy-paligemma-2-with-tgi) | Deploy PaliGemma 2 with TGI DLC on GKE | | [tei-deployment](./tei-deployment) | Deploy Snowflake's Arctic Embed with TEI DLC on GKE | | [tei-from-gcs-deployment](./tei-from-gcs-deployment) | Deploy BGE Base v1.5 with TEI DLC from GCS on GKE | | [tgi-deployment](./tgi-deployment) | Deploy Meta Llama 3 8B with TGI DLC on GKE | diff --git a/examples/gke/deploy-paligemma-2-with-tgi/README.md b/examples/gke/deploy-paligemma-2-with-tgi/README.md new file mode 100644 index 00000000..f1f75706 --- /dev/null +++ b/examples/gke/deploy-paligemma-2-with-tgi/README.md @@ -0,0 +1,403 @@ +--- +title: Deploy PaliGemma 2 with TGI DLC on GKE +type: inference +--- + +# Deploy PaliGemma 2 with TGI DLC on GKE + +PaliGemma 2 is the latest multilingual vision-language model released by Google. It combines the SigLIP vision model with the Gemma 2 language model, enabling it to process both images and text inputs to generate text outputs for various tasks, including captioning, visual question answering, and object detection. Text Generation Inference (TGI) is a toolkit developed by Hugging Face for deploying and serving LLMs, with high performance text generation. Google Kubernetes Engine (GKE) is a fully-managed Kubernetes service in Google Cloud that can be used to deploy and operate containerized applications at scale using Google Cloud infrastructure. + +This example showcases how to deploy Google PaliGemma 2 from the Hugging Face Hub on a GKE Cluster, running a purpose-built container to deploy LLMs and VLMs in a secure and managed environment with the Hugging Face DLC for TGI. Additionally, this example also presents different scenarios or use-cases where PaliGemma 2 can be used. + +## Setup / Configuration + +> [!NOTE] +> Some configuration steps such as the `gcloud`, `kubectl`, and `gke-cloud-auth-plugin` installation are not required if running the example within the Google Cloud Shell, as it already comes with those dependencies installed. It's also automatically logged in with the current account and project selected on Google Cloud. + +Optionally, we recommend you set the following environment variables for convenience, and to avoid duplicating the values elsewhere in the example: + +```bash +export PROJECT_ID=your-project-id +export LOCATION=your-location +export CLUSTER_NAME=your-cluster-name +``` + +### Requirements + +First, you need to install both `gcloud` and `kubectl` in your local machine, which are the command-line tools to interact with Google Cloud and Kubernetes, respectively. + +- To install `gcloud`, follow the instructions at [Cloud SDK Documentation - Install the gcloud CLI](https://cloud.google.com/sdk/docs/install). +- To install `kubectl`, follow the instructions at [Kubernetes Documentation - Install Tools](https://kubernetes.io/docs/tasks/tools/#kubectl). + +Additionally, to use `kubectl` with the GKE Cluster credentials, you also need to install the `gke-gcloud-auth-plugin`, that can be installed with `gcloud` as follows: + +```bash +gcloud components install gke-gcloud-auth-plugin +``` + +> [!NOTE] +> There are other ways to install the `gke-gcloud-auth-plugin` that you can check in the [GKE Documentation - Install kubectl and configure cluster access](https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-access-for-kubectl#install_plugin). + +### Login and API enablement + +Then you need to login into your Google Cloud account and set the project ID to the one you want to use for the deployment of the GKE Cluster. + +```bash +gcloud auth login +gcloud auth application-default login # Required for local development +gcloud config set project $PROJECT_ID +``` + +Once you are logged in, you need to enable the necessary service APIs in Google Cloud, such as the Google Kubernetes Engine API, the Google Container Registry API, and the Google Container File System API, which are necessary for the deployment of the GKE Cluster and the Hugging Face DLC for TGI. + +```bash +gcloud services enable container.googleapis.com +gcloud services enable containerregistry.googleapis.com +gcloud services enable containerfilesystem.googleapis.com +``` + +### PaliGemma 2 gating and Hugging Face access token + +[`google/paligemma2-3b-pt-224`](https://huggingface.co/google/paligemma2-3b-pt-224) is a gated model, as well as the [rest of the official PaliGemma 2 models](https://huggingface.co/collections/google/paligemma-2-release-67500e1e1dbfdd4dee27ba48). In order to use any of them and being able to download the weights, you first need to accept their gating / license in one of the model cards. + +![PaliGemma 2 Gating on the Hugging Face Hub](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/google-cloud/examples/gke/deploy-paligemma-2-with-tgi/model-gating.png) + +Once you have been granted access to the PaliGemma 2 models on the Hub, you need to generate either a fine-grained or a read-access token. A fine-grained token allows you to scope permissions to the desired models, such [`google/paligemma2-3b-pt-224`](https://huggingface.co/google/paligemma2-3b-pt-224), so you can download the weights, and is the recommended option. A read-access token would allow access to all the models your account has access to. To generate access tokens for the Hugging Face Hub you can follow the instructions at [Hugging Face Hub Documentation - User access tokens](https://huggingface.co/docs/hub/en/security-tokens). + +After the access token is generated, the recommended way of setting it is via the Python CLI `huggingface-cli` that comes with the `huggingface_hub` Python SDK, that can be installed as follows: + +```bash +pip install --upgrade --quiet huggingface_hub +``` + +And then login in with the generated access token with read-access over the gated/private model as: + +```bash +huggingface-cli login +``` + +## Create GKE Cluster + +To deploy the GKE Cluster, the "Autopilot" mode will be used as it is the recommended one for most of the workloads, since the underlying infrastructure is managed by Google; meaning that there's no need to create a node pool in advance or set up their ingress. Alternatively, you can also use the "Standard" mode, but that may require more configuration steps and being more aware / knowledgeable of Kubernetes. + +> [!NOTE] +> Before creating the GKE Autopilot Cluster on a different version than the one pinned below, you should read the [GKE Documentation - Optimize Autopilot Pod performance by choosing a machine series](https://cloud.google.com/kubernetes-engine/docs/how-to/performance-pods) page, as not all the Kubernetes versions available on GKE support GPU accelerators (e.g. `nvidia-l4` is not supported on GKE for Kubernetes 1.28.3 or lower). + +```bash +gcloud container clusters create-auto $CLUSTER_NAME \ + --project=$PROJECT_ID \ + --location=$LOCATION \ + --release-channel=stable \ + --cluster-version=1.30 \ + --no-autoprovisioning-enable-insecure-kubelet-readonly-port +``` + +> [!NOTE] +> If you want to change the Kubernetes version running on the GKE Cluster, you can do so, but make sure to check which are the latest supported Kubernetes versions in the location where you want to create the cluster on, with the following command: +> +> ```bash +> gcloud container get-server-config \ +> --flatten="channels" \ +> --filter="channels.channel=STABLE" \ +> --format="yaml(channels.channel,channels.defaultVersion)" \ +> --location=$LOCATION +> ``` +> +> Additionally, note that you can also use the "RAPID" channel instead of the "STABLE" if you require any Kubernetes feature not shipped yet within the latest Kubernetes version released on the "STABLE" channel, even though using the "STABLE" channel is recommended. For more information please visit the [GKE Documentation - Specifying cluster version](https://cloud.google.com/kubernetes-engine/versioning#specifying_cluster_version). + +![GKE Cluster in the Google Cloud Console](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/google-cloud/examples/gke/deploy-paligemma-2-with-tgi/gke-cluster.png) + +## Get GKE Cluster Credentials + +Once the GKE Cluster is created, you need to get the credentials to access it via `kubectl`: + +```bash +gcloud container clusters get-credentials $CLUSTER_NAME --location=$LOCATION +``` + +Then you will be ready to use `kubectl` commands that will be calling the Kubernetes Cluster you just created on GKE. + +## Set Hugging Face Secrets on GKE + +As [`google/paligemma2-3b-pt-224`](https://huggingface.co/google/paligemma2-3b-pt-224) is a gated model and requires a Hugging Face Hub access token to download the weights [as mentioned before](#paligemma2-gating-and-hugging-face-access-token), you need to set a Kubernetes secret with the Hugging Face Hub token previously generated, with the following command (assuming that you have the `huggingface_hub` Python SDK installed): + +```bash +kubectl create secret generic hf-secret \ + --from-literal=hf_token=$(python -c "from huggingface_hub import get_token; print(get_token())") \ + --dry-run=client -o yaml | kubectl apply -f - +``` + +Alternatively, even if not recommended, you can also directly set the access token pasting it within the `kubectl` command as follows (make sure to replace that with your own token): + +```bash +kubectl create secret generic hf-secret \ + --from-literal=hf_token=hf_*** \ + --dry-run=client -o yaml | kubectl apply -f - +``` + +![GKE Secret in the Google Cloud Console](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/google-cloud/examples/gke/deploy-paligemma-2-with-tgi/gke-secrets.png) + +More information on how to set Kubernetes secrets in a GKE Cluster check the [GKE Documentation - Specifying cluster version](https://cloud.google.com/secret-manager/docs/secret-manager-managed-csi-component). + +## Deploy TGI on GKE + +Now you can proceed to the Kubernetes deployment of the Hugging Face DLC for TGI, serving the [`google/paligemma2-3b-pt-224`](https://huggingface.co/google/paligemma2-3b-pt-224) model from the Hugging Face Hub. To explore all the models from the Hugging Face Hub that can be served with TGI, you can explore [the models tagged with `text-generation-inference` in the Hub](https://huggingface.co/models?other=text-generation-inference). + +PaliGemma 2 will be deployed from the following Kubernetes Deployment Manifest (including the Service): + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: tgi +spec: + replicas: 1 + selector: + matchLabels: + app: tgi + template: + metadata: + labels: + app: tgi + hf.co/model: google--paligemma2-3b-pt-224 + hf.co/task: text-generation + spec: + containers: + - name: tgi + image: "us-central1-docker.pkg.dev/gcp-partnership-412108/deep-learning-images/huggingface-text-generation-inference-gpu.3.0.1" + # image: "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.3-0.ubuntu2204.py311" + resources: + requests: + nvidia.com/gpu: 1 + limits: + nvidia.com/gpu: 1 + env: + - name: MODEL_ID + value: "google/paligemma2-3b-pt-224" + - name: NUM_SHARD + value: "1" + - name: PORT + value: "8080" + - name: HF_TOKEN + valueFrom: + secretKeyRef: + name: hf-secret + key: hf_token + volumeMounts: + - mountPath: /dev/shm + name: dshm + - mountPath: /tmp + name: tmp + volumes: + - name: dshm + emptyDir: + medium: Memory + sizeLimit: 1Gi + - name: tmp + emptyDir: {} + nodeSelector: + cloud.google.com/gke-accelerator: nvidia-l4 +# --- +apiVersion: v1 +kind: Service +metadata: + name: tgi +spec: + selector: + app: tgi + type: ClusterIP + ports: + - protocol: TCP + port: 8080 + targetPort: 8080 +``` + +You can either deploy by copying the content above into a file named `deployment.yaml` and then deploy it with the following command: + +```bash +kubectl apply -f deployment.yaml +``` + +Optionally, if you also want to deploy the Ingress to e.g. expose a public IP to access the Service, then you should then copy the following content into a file named `ingress.yaml`: + +```yaml +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + name: tgi + # https://cloud.google.com/kubernetes-engine/docs/concepts/ingress + annotations: + kubernetes.io/ingress.class: "gce" +spec: + rules: + - http: + paths: + - path: / + pathType: Prefix + backend: + service: + name: tgi + port: + number: 8080 +``` + +And, then deploy it with the following command: + +```bash +kubectl apply -f ingress.yaml +``` + +> [!NOTE] +> Alternatively, you can just clone the [`huggingface/Google-Cloud-Containers`](https://github.com/huggingface/Google-Cloud-Containers) repository from GitHub and the apply the configuration including all the Kubernetes Manifests mentioned above as it follows: +> +> ```bash +> git clone https://github.com/huggingface/Google-Cloud-Containers +> kubectl apply -f Google-Cloud-Containers/examples/gke/deploy-paligemma-2-with-tgi/config +> ``` + +![GKE Deployment in the Google Cloud Console](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/google-cloud/examples/gke/deploy-paligemma-2-with-tgi/gke-deployment.png) + +![GKE Deployment Logs in the Google Cloud Console](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/google-cloud/examples/gke/deploy-paligemma-2-with-tgi/gke-deployment-logs.png) + +> [!NOTE] +> The Kubernetes deployment may take a few minutes to be ready, so you can check the status of the pod/s being deployed on the default namespace with the following command: +> +> ```bash +> kubectl get pods +> ``` +> +> Alternatively, you can just wait (700 seconds) for the deployment to be ready with the following command: +> +> ```bash +> kubectl wait --for=condition=Available --timeout=700s deployment/tgi +> ``` + +## Accessing TGI on GKE + +To access the deployed TGI service, you have two options: + +1. Port-forwarding the service +2. Using the ingress (if configured) + +### Port-forwarding + +You can port-forward the deployed TGI service to port 8080 on your local machine using the following command: + +```bash +kubectl port-forward service/tgi 8080:8080 +``` + +This allows you to access the service via `localhost:8080`. + +### Accessing via Ingress + +If you've configured the ingress (as defined in the [`ingress.yaml`](./config/ingress.yaml) file), you can access the service using the external IP of the ingress. Retrieve the external IP with this command: + +```bash +kubectl get ingress tgi -o jsonpath='{.status.loadBalancer.ingress.ip}' +``` + +Finally, to make sure that the service is healthy and reachable via either `localhost` or the ingress IP (depending on how you exposed the service as of the step above), you can send the following `curl` command: + +```bash +curl http://localhost:8080/health +``` + +And that's it, TGI is now reachable and healthy on GKE! + +## Inference with TGI on GKE + +Before sending the `curl` request for inference, you need to note that the PaliGemma variant that you are serving is [`google/paligemma2-3b-pt-224`](https://huggingface.co/google/paligemma2-3b-pt-224) i.e. the pre-trained variant, meaning that's not particularly usable out of the box for any task, but just to transfer well to other tasks after the fine-tuning; anyway, it's pre-trained on a set of given tasks following the previous [PaLI: A Jointly-Scaled Multilingual Language-Image Model](https://arxiv.org/abs/2209.06794) works, which are the following and, so on, the supported prompt formats that will work out of the box via the `/generate` endpoint: + +- `caption {lang}`: Simple captioning objective on datasets like WebLI and CC3M-35L +- `ocr`: Transcription of text on the image using a public OCR system +- `answer en {question}`: Generated VQA on CC3M-35L and object-centric questions on OpenImages +- `question {lang} {English answer}`: Generated VQG on CC3M-35L in 35 languages for given English answers +- `detect {thing} ; {thing} ; ...`: Multi-object detection on generated open-world data +- `segment {thing} ; {thing} ; ...`: Multi-object instance segmentation on generated open-world data +- `caption `: Grounded captioning of content within a specified box + +The PaliGemma and PaliGemma 2 models require the BOS token after the images and before the prefix and then `\n` i.e. the line-break, as the separator token from suffix (input) and the prefix (output); which are both automatically included by the `transformers.PaliGemmaProcessor`, meaning that there's no need to provide those explicitly to the `/generate` endpoint in TGI. + +The images should be provided following the Markdown syntax for image rendering i.e. `![]()`, which requires the image URL to be publicly accessible. Alternatively, you can provide images in the request using base64 encoding of the image data. + +This means that the prompt formatting expected on the `/generate` method is either: + +- `![]()` if the image is provided via URL. +- `![](data:image/png;base64,)` if the image is provided using base64 encoding. + +Read more information about the technical details and implementation of PaliGemma on the papers / technical reports released by Google: + +- [PaliGemma: A versatile 3B VLM for transfer](https://arxiv.org/abs/2407.07726) +- [PaliGemma 2: A Family of Versatile VLMs for Transfer](https://arxiv.org/abs/2412.03555) + +> [!NOTE] +> Note that the `/v1/chat/completions` endpoint cannot be used, and will result in a "chat template error not found", as the model is pre-trained and not fine-tuned for chat conversations, and does not have a chat template that can be applied within the `v1/chat/completions` endpoint following the OpenAI OpenAPI specification. + +### Via cURL + +To send a POST request to the TGI service using `cURL`, you can run the following command: + +```bash +curl http://localhost:8080/generate \ + -d '{"inputs":"![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png)caption en","parameters":{"max_new_tokens":128,"seed":42}}' \ + -H 'Content-Type: application/json' +``` + +| Image | Input | Output | +|------------------------------------------------------------------------------------------------------------|------------|-------------------------------| +| ![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png) | caption en | image of a man in a spacesuit | + +### Via Python + +You can install it via pip as `pip install --upgrade --quiet huggingface_hub`, and then run the following snippet to mimic the cURL command above i.e. sending requests to the Generate API: + +```python +from huggingface_hub import InferenceClient + +client = InferenceClient("http://localhost:8080", api_key="-") + +generation = client.text_generation( + prompt="![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png)caption en", + max_new_tokens=128, + seed=42, +) +``` + +Or, if you don't have a public URL with the image hosted, you can also send the base64 encoding of the image from the image file as it follows: + +```python +import base64 +from huggingface_hub import InferenceClient + +client = InferenceClient("http://localhost:8080", api_key="-") + +with open("/path/to/image.png", "rb") as f: + b64_image = base64.b64encode(f.read()).decode("utf-8") + +generation = client.text_generation( + prompt=f"![](data:image/png;base64,{b64_image})caption en", + max_new_tokens=128, + seed=42, +) +``` + +Both producing the following output: + +```json +{"generated_text": "image of a man in a spacesuit"} +``` + +## Delete GKE Cluster + +Finally, once you are done using TGI on the GKE Cluster, you can safely delete the GKE Cluster to avoid incurring in unnecessary costs. + +```bash +gcloud container clusters delete $CLUSTER_NAME --location=$LOCATION +``` + +Alternatively, you can also downscale the replicas of the deployed pod to 0 in case you want to preserve the cluster, since the default GKE Cluster deployed with GKE Autopilot mode is running just a single `e2-small` instance. + +```bash +kubectl scale --replicas=0 deployment/tgi +``` diff --git a/examples/gke/deploy-paligemma-2-with-tgi/config/deployment.yaml b/examples/gke/deploy-paligemma-2-with-tgi/config/deployment.yaml new file mode 100644 index 00000000..d66b6112 --- /dev/null +++ b/examples/gke/deploy-paligemma-2-with-tgi/config/deployment.yaml @@ -0,0 +1,51 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: tgi +spec: + replicas: 1 + selector: + matchLabels: + app: tgi + template: + metadata: + labels: + app: tgi + hf.co/model: google--paligemma2-3b-pt-224 + hf.co/task: text-generation + spec: + containers: + - name: tgi + image: "us-central1-docker.pkg.dev/gcp-partnership-412108/deep-learning-images/huggingface-text-generation-inference-gpu.3.0.1" + # image: "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.3-0.ubuntu2204.py311" + resources: + requests: + nvidia.com/gpu: 1 + limits: + nvidia.com/gpu: 1 + env: + - name: MODEL_ID + value: google/paligemma2-3b-pt-224 + - name: NUM_SHARD + value: "1" + - name: PORT + value: "8080" + - name: HF_TOKEN + valueFrom: + secretKeyRef: + name: hf-secret + key: hf_token + volumeMounts: + - mountPath: /dev/shm + name: dshm + - mountPath: /tmp + name: tmp + volumes: + - name: dshm + emptyDir: + medium: Memory + sizeLimit: 1Gi + - name: tmp + emptyDir: {} + nodeSelector: + cloud.google.com/gke-accelerator: nvidia-l4 diff --git a/examples/gke/deploy-paligemma-2-with-tgi/config/ingress.yaml b/examples/gke/deploy-paligemma-2-with-tgi/config/ingress.yaml new file mode 100644 index 00000000..3f668d01 --- /dev/null +++ b/examples/gke/deploy-paligemma-2-with-tgi/config/ingress.yaml @@ -0,0 +1,18 @@ +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + name: tgi + # https://cloud.google.com/kubernetes-engine/docs/concepts/ingress + annotations: + kubernetes.io/ingress.class: "gce" +spec: + rules: + - http: + paths: + - path: / + pathType: Prefix + backend: + service: + name: tgi + port: + number: 8080 diff --git a/examples/gke/deploy-paligemma-2-with-tgi/config/service.yaml b/examples/gke/deploy-paligemma-2-with-tgi/config/service.yaml new file mode 100644 index 00000000..1dea9865 --- /dev/null +++ b/examples/gke/deploy-paligemma-2-with-tgi/config/service.yaml @@ -0,0 +1,12 @@ +apiVersion: v1 +kind: Service +metadata: + name: tgi +spec: + selector: + app: tgi + type: ClusterIP + ports: + - protocol: TCP + port: 8080 + targetPort: 8080 diff --git a/examples/vertex-ai/README.md b/examples/vertex-ai/README.md index 534a404b..f090001f 100644 --- a/examples/vertex-ai/README.md +++ b/examples/vertex-ai/README.md @@ -12,6 +12,7 @@ For Google Vertex AI, we differentiate between the executable Jupyter Notebook e | ---------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------- | | [notebooks/trl-lora-sft-fine-tuning-on-vertex-ai](./notebooks/trl-lora-sft-fine-tuning-on-vertex-ai) | Fine-tune Gemma 2B with PyTorch Training DLC using SFT + LoRA on Vertex AI | | [notebooks/trl-full-sft-fine-tuning-on-vertex-ai](./notebooks/trl-full-sft-fine-tuning-on-vertex-ai) | Fine-tune Mistral 7B v0.3 with PyTorch Training DLC using SFT on Vertex AI | +| [notebooks/fine-tune-paligemma-2-with-pytorch](./notebooks/fine-tune-paligemma-2-with-pytorch) | Fine-tune PaliGemma 2 with PyTorch Training DLC on Vertex AI | ### Inference Examples diff --git a/examples/vertex-ai/pipelines/fine-tune-paligemma-2-with-pytorch/vertex-notebook.ipynb b/examples/vertex-ai/pipelines/fine-tune-paligemma-2-with-pytorch/vertex-notebook.ipynb new file mode 100644 index 00000000..3adc8899 --- /dev/null +++ b/examples/vertex-ai/pipelines/fine-tune-paligemma-2-with-pytorch/vertex-notebook.ipynb @@ -0,0 +1,925 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "5fce5222-97d7-4572-b50e-2fa60c7f4a17", + "metadata": {}, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "id": "e270e405-39b7-4536-a22a-37ba8461f55e", + "metadata": {}, + "source": [ + "# Fine-tune PaliGemma 2 with PyTorch Training DLC on Vertex AI" + ] + }, + { + "cell_type": "markdown", + "id": "524dddf3-c05f-44bf-852e-0eb75efa8010", + "metadata": {}, + "source": [ + "PaliGemma 2 is the latest multilingual vision-language model released by Google. It combines the SigLIP vision model with the Gemma 2 language model, enabling it to process both images and text inputs to generate text outputs for various tasks, including captioning, visual question answering, and object detection. Hugging Face PyTorch Training DLC is a container that comes with all the Hugging Face and PyTorch dependencies required to fine-tune any model ranging from Transformers, Diffusers, and Sentence Transformers installed. Google Vertex AI is a Machine Learning (ML) platform that lets you train and deploy ML models and AI applications, and customize large language models (LLMs) for use in your AI-powered applications.\n", + "\n", + "This example showcases how to fine-tune Google PaliGemma 2 on multiple GPUs using Ray with the Hugging Face PyTorch DLC for Training on GPU with a purpose-built container to train and fine-tune Transformers, Diffusers and Sentence Transformers models." + ] + }, + { + "cell_type": "markdown", + "id": "22f11d5e-ed3c-4515-98a6-ec1a96259a21", + "metadata": {}, + "source": [ + "![PaliGemma 2 on the Hugging Face Hub](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/google-cloud/examples/vertex-ai/pipelines/fine-tune-paligemma-2-with-pytorch/model-on-hub.png)" + ] + }, + { + "cell_type": "markdown", + "id": "18d1ebf7-838e-4aff-a765-b7025b339a85", + "metadata": {}, + "source": [ + "## Setup / Configuration" + ] + }, + { + "cell_type": "markdown", + "id": "46cfe893-d160-48f6-8b42-6690efb6af1c", + "metadata": {}, + "source": [ + "> [!NOTE]\n", + "> Some configuration steps such as the `gcloud` installation or logging into Google Cloud are not required when running the example on Google Cloud Shell, as it already comes with `gcloud` installed and logged in with the current account and project selected on Google Cloud. \n", + "\n", + "Optionally, we recommend you set the following environment variables for convenience, and to avoid duplicating the values elsewhere in the example:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "de4206cc-01ea-427b-bd64-b907542e55f6", + "metadata": {}, + "outputs": [], + "source": [ + "%env PROJECT_ID=your-project-id\n", + "%env LOCATION=your-location\n", + "%env SECRET_ID=hf_token\n", + "%env SERVICE_ACCOUNT_NAME=your-service-account-name\n", + "%env BUCKET_NAME=your-bucket-name\n", + "%env CONTAINER_URI=us-central1-docker.pkg.dev/gcp-partnership-412108/deep-learning-images/huggingface-pytorch-training-gpu.2.3.1.transformers.4.48.0.py311\n", + "# %env CONTAINER_URI=us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-training-cu121.2-3.transformers.4-48.ubuntu2204.py311" + ] + }, + { + "cell_type": "markdown", + "id": "fd94734a-3097-45e7-8943-57bc599d9281", + "metadata": {}, + "source": [ + "### Requirements" + ] + }, + { + "cell_type": "markdown", + "id": "fc37f08b-1a47-4d83-a642-38f80837e387", + "metadata": {}, + "source": [ + "First, you need to install `gcloud` in your local machine, which is the command-line tool to interact with Google Cloud. To install `gcloud`, follow the instructions at [Cloud SDK Documentation - Install the gcloud CLI](https://cloud.google.com/sdk/docs/install)." + ] + }, + { + "cell_type": "markdown", + "id": "29edd6cc-1240-48b4-9d9b-ed84a8ec50e8", + "metadata": {}, + "source": [ + "### Login and API enablement" + ] + }, + { + "cell_type": "markdown", + "id": "3981bf97-6bfa-472e-a1d7-fef76256e24b", + "metadata": {}, + "source": [ + "Then you need to login into your Google Cloud account and set the project ID to the one you want to use for running the Vertex AI Pipeline." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7e267292-4373-4f62-b0db-e9c35cceb014", + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "\n", + "gcloud auth login\n", + "gcloud auth application-default login # Required for local development\n", + "gcloud config set project $PROJECT_ID" + ] + }, + { + "cell_type": "markdown", + "id": "443284fe-4e6e-4adb-badf-91d39e8a3b35", + "metadata": {}, + "source": [ + "Once you are logged in, you need to enable the necessary service APIs in Google Cloud for running Vertex AI Pipelines with the Hugging Face PyTorch DLC for Training. These include the Vertex AI API (`aiplatform.googleapis.com`), the Identity and Access Management (IAM) API (`iam.googleapis.com`), the Artifact Registry API (`artifactregistry.googleapis.com`), the Cloud Storage API (`storage-api.googleapis.com`), and the Secret Manager API (`secretmanager.googleapis.com`)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "be39f797-cf17-43d3-be4f-11b7cba9ecc1", + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "\n", + "gcloud services enable aiplatform.googleapis.com\n", + "gcloud services enable iam.googleapis.com\n", + "gcloud services enable artifactregistry.googleapis.com\n", + "gcloud services enable storage-api.googleapis.com\n", + "gcloud services enable secretmanager.googleapis.com" + ] + }, + { + "cell_type": "markdown", + "id": "5a5cbceb-4bac-498b-8fb8-cf93de32484e", + "metadata": {}, + "source": [ + "### (Optional) Google Cloud Storage (GCS) bucket creation" + ] + }, + { + "cell_type": "markdown", + "id": "03c82f31-ce3e-46d5-9d86-720251cf0cd0", + "metadata": {}, + "source": [ + "If the Google Cloud Storage (GCS) bucket is not created yet, you can different approaches to create it, find all the alternatives listed in the [Google Cloud Storage Documentation - Create a bucket](https://cloud.google.com/storage/docs/creating-buckets). In this case, for simplicity, the `gcloud` CLI will be used, but you can use any of the different alternatives." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e110303a-51cc-473b-8c63-c2b71d395317", + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "\n", + "if gsutil ls -b gs://$BUCKET_NAME &>/dev/null; then\n", + " echo \"Bucket gs://$BUCKET_NAME already exists.\"\n", + "else\n", + " gcloud storage buckets create gs://$BUCKET_NAME \\\n", + " --project=$PROJECT_ID \\\n", + " --location=$LOCATION \\\n", + " --default-storage-class=STANDARD \\\n", + " --uniform-bucket-level-access\n", + "\n", + " if [ $? -eq 0 ]; then\n", + " echo \"Bucket gs://$BUCKET_NAME created successfully.\"\n", + " else\n", + " echo \"Failed to create bucket gs://$BUCKET_NAME.\"\n", + " fi\n", + "fi" + ] + }, + { + "cell_type": "markdown", + "id": "12d9e445-e3b2-40fd-9651-344efadb14ef", + "metadata": {}, + "source": [ + "### PaliGemma 2 gating and Hugging Face access token" + ] + }, + { + "cell_type": "markdown", + "id": "901393a8-e6f2-46a5-a773-42350ee1e9a5", + "metadata": {}, + "source": [ + "[`google/paligemma2-3b-pt-448`](https://huggingface.co/google/paligemma2-3b-pt-448) is a gated model, as well as the [rest of the official PaliGemma 2 models](https://huggingface.co/collections/google/paligemma-2-release-67500e1e1dbfdd4dee27ba48). In order to use any of them and being able to download the weights, you first need to accept their gating / license in one of the model cards.\n", + "\n", + "![PaliGemma 2 Gating on the Hugging Face Hub](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/google-cloud/examples/vertex-ai/pipelines/fine-tune-paligemma-2-with-pytorch/model-gating.png)\n", + "\n", + "Once you have been granted access to the PaliGemma 2 models on the Hub, you need to generate either a fine-grained or a read-access token. A fine-grained token allows you to scope permissions to the desired models, such [`google/paligemma2-3b-pt-448`](https://huggingface.co/google/paligemma2-3b-pt-448), so you can download the weights, and is the recommended option. A read-access token would allow access to all the models your account has access to. To generate access tokens for the Hugging Face Hub you can follow the instructions at [Hugging Face Hub Documentation - User access tokens](https://huggingface.co/docs/hub/en/security-tokens).\n", + "\n", + "After the access token is generated, the recommended way of setting it is via the Python CLI `huggingface-cli` that comes with the `huggingface_hub` Python SDK, that can be installed as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "59c5e043-7164-4eaf-bbb1-34d06d3db761", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install --upgrade --quiet huggingface_hub" + ] + }, + { + "cell_type": "markdown", + "id": "3f1506d5-328f-45d3-9dfb-7d64f8ccf0f2", + "metadata": {}, + "source": [ + "And then login in with the generated access token with read-access over the gated/private model as:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e366df6d-ede0-48c8-9ca8-3b2ed6489d09", + "metadata": {}, + "outputs": [], + "source": [ + "from huggingface_hub import notebook_login\n", + "\n", + "notebook_login()" + ] + }, + { + "cell_type": "markdown", + "id": "bd4a8f66-b741-48d8-a7d3-634942d0c2ae", + "metadata": {}, + "source": [ + "Finally, you will need to set it as a secret on Google Cloud's Secret Manager as that value will later be pulled by the Vertex Pipeline when accessing / reading the model artifacts in a secure way, as otherwise the token would be exposed as an argument to the Vertex Pipeline." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5a77469c-bd3c-4b19-862a-af6d49529f0b", + "metadata": {}, + "outputs": [], + "source": [ + "!python -c \"from huggingface_hub import get_token; print(get_token(), end='')\" | gcloud secrets versions add $SECRET_NAME --data-file=-" + ] + }, + { + "cell_type": "markdown", + "id": "33176be4-3955-46bb-8564-31c6a274f373", + "metadata": {}, + "source": [ + "Or just echo the generated token as it follows (both options are secure as long as the instance where you are running this command from is private and only authorized people have access to it, otherwise the token may be leaked within the bash history too):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0f265c58-0ec5-4e61-af22-b6398c5cdd4a", + "metadata": {}, + "outputs": [], + "source": [ + "!echo -n \"hf_***\" | gcloud secrets versions add $SECRET_NAME --data-file=-" + ] + }, + { + "cell_type": "markdown", + "id": "07a3ff8f-d1cb-4628-9a18-777dd83351fc", + "metadata": {}, + "source": [ + "### Service Account for Vertex AI" + ] + }, + { + "cell_type": "markdown", + "id": "3ac0dce6-e901-4176-9c07-58be587bdd78", + "metadata": {}, + "source": [ + "Finally, you will need to create a Service Account for Vertex AI with the following default permissions:\n", + "\n", + "- Vertex AI Administrator (`roles/aiplatform.admin`): Provides full control over Vertex AI resources, including creating and managing machine learning models and pipelines.\n", + "- Service Account User (`roles/iam.serviceAccountUser`): Allows the service account to impersonate other service accounts, which may be necessary for certain Vertex AI operations.\n", + "- Vertex AI Service Agent (`roles/aiplatform.serviceAgent`): Grants permissions required for Vertex AI to interact with other Google Cloud services on behalf of the user.\n", + "\n", + "And with the following additional permissions, specific to the current pipeline:\n", + "\n", + "- Artifact Registry Reader (`roles/artifactregistry.reader`): Allows the service account to read container images from Artifact Registry.\n", + "- Storage Object Creator (`roles/storage.objectCreator`): Grants access to create objects in the specified Google Cloud Storage bucket, for storing model artifacts and datasets.\n", + "- Secret Manager Secret Accessor (`roles/secretmanager.secretAccessor`): Enables the service account to access the specified secret which is the Hugging Face token.\n", + "\n", + "These permissions enable the service account to manage Vertex AI resources, access necessary data and secrets, and interact with related Google Cloud services as required for this specific pipeline implementation." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "08a00d5d-9a64-48f0-938e-707623ee05e8", + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "\n", + "set -ex\n", + "\n", + "SERVICE_ACCOUNT_EMAIL=\"${SERVICE_ACCOUNT_NAME}@${PROJECT_ID}.iam.gserviceaccount.com\"\n", + "\n", + "# Remove the service account if it already exists\n", + "if gcloud iam service-accounts describe $SERVICE_ACCOUNT_EMAIL --project=$PROJECT_ID &>/dev/null; then\n", + " gcloud iam service-accounts delete $SERVICE_ACCOUNT_EMAIL --project=$PROJECT_ID --quiet\n", + "fi\n", + "\n", + "gcloud iam service-accounts create $SERVICE_ACCOUNT_NAME \\\n", + " --display-name=\"Vertex Pipeline Runner\" \\\n", + " --project=$PROJECT_ID\n", + "\n", + "while ! gcloud iam service-accounts describe $SERVICE_ACCOUNT_EMAIL --project=$PROJECT_ID &>/dev/null; do\n", + " echo \"Waiting for service account to be ready...\"\n", + " sleep 5\n", + "done\n", + "\n", + "gcloud projects add-iam-policy-binding $PROJECT_ID \\\n", + " --member=\"serviceAccount:$SERVICE_ACCOUNT_EMAIL\" \\\n", + " --role=\"roles/aiplatform.admin\"\n", + "\n", + "gcloud projects add-iam-policy-binding $PROJECT_ID \\\n", + " --member=\"serviceAccount:$SERVICE_ACCOUNT_EMAIL\" \\\n", + " --role=\"roles/iam.serviceAccountUser\"\n", + "\n", + "gcloud projects add-iam-policy-binding $PROJECT_ID \\\n", + " --member=\"serviceAccount:$SERVICE_ACCOUNT_EMAIL\" \\\n", + " --role=\"roles/aiplatform.serviceAgent\"\n", + "\n", + "gcloud projects add-iam-policy-binding $PROJECT_ID \\\n", + " --member=\"serviceAccount:$SERVICE_ACCOUNT_EMAIL\" \\\n", + " --role=\"roles/artifactregistry.reader\"\n", + "\n", + "gcloud storage buckets add-iam-policy-binding gs://$BUCKET_NAME \\\n", + " --member=\"serviceAccount:$SERVICE_ACCOUNT_EMAIL\" \\\n", + " --role=\"roles/storage.objectCreator\"\n", + "\n", + "gcloud secrets add-iam-policy-binding $SECRET_NAME \\\n", + " --member=\"serviceAccount:$SERVICE_ACCOUNT_EMAIL\" \\\n", + " --role=\"roles/secretmanager.secretAccessor\" \\\n", + " --project=$PROJECT_ID" + ] + }, + { + "cell_type": "markdown", + "id": "fecf6da5-f8c0-4f25-a3ab-61473c42b6c6", + "metadata": {}, + "source": [ + "## Define Vertex Pipeline" + ] + }, + { + "cell_type": "markdown", + "id": "19ed106e-3341-4484-9198-2837017ceb4b", + "metadata": {}, + "source": [ + "Once everything's configured, you can define the Kubeflow Pipeline that will run on Vertex AI. In this case, for fine-tuning a LoRA adapter for the PaliGemma 2 pre-trained model, Transformers will be used in combination with Ray for distributed fine-tuning across multiple GPUs (4 x NVIDIA L4), and the dataset will be a subset of [VQAv2](https://huggingface.co/datasets/HuggingFaceM4/VQAv2), a visual question-answering dataset.\n", + "\n", + "> [!NOTE]\n", + "> The fine-tuning script has been partially ported from [`Fine_tune_PaliGemma.ipynb` in `merveenoyan/smol-vision`](https://github.com/merveenoyan/smol-vision/blob/main/Fine_tune_PaliGemma.ipynb), but has been adapted into a multi-stage Kubeflow Pipeline where each step of the process has been broken down into a Kubeflow Component.\n", + "\n", + "The idea of breaking down the training on different components is mainly for reusability (if caching is enabled), tracing, intermediate artifact storage, and resource management; as you don't want to use the same instance for e.g. downloading artifacts over an HTTP connection than the instance you want to use for the actual fine-tuning which is a heavier workload. Find all the components below:" + ] + }, + { + "cell_type": "markdown", + "id": "9cfc5ca4-f865-4a2a-831a-ce1ca2e1eddb", + "metadata": {}, + "source": [ + "### 1. Download Dataset from Hugging Face Hub" + ] + }, + { + "cell_type": "markdown", + "id": "45706424-90fe-4048-95af-51eeb4b634be", + "metadata": {}, + "source": [ + "To download the dataset from the Hugging Face Hub, you need to define a component that pulls the dataset with `datasets.load_dataset`, then splits it into train and test, and saves the output to disk; assuming that the disk is a remote mount pointing to the previously created bucket." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a25cf90f-ce54-47c1-94c6-8ac054748524", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "from typing import Optional\n", + "from kfp.dsl import Dataset, Output, component\n", + "\n", + "@component(base_image=os.getenv(\"CONTAINER_URI\"))\n", + "def download_dataset_from_hub(\n", + " dataset_id: str, dataset: Output[Dataset], split: Optional[str] = None, test_size: Optional[float] = 0.1,\n", + ") -> None:\n", + " from datasets import load_dataset\n", + "\n", + " ds = load_dataset(dataset_id, split=split)\n", + " ds = ds.train_test_split(test_size=test_size) # type: ignore\n", + " ds.save_to_disk(dataset.path)" + ] + }, + { + "cell_type": "markdown", + "id": "e7135919-b173-4b63-98b6-30dbd54bb929", + "metadata": {}, + "source": [ + "### 2. Download Model from Hugging Face Hub" + ] + }, + { + "cell_type": "markdown", + "id": "99a7905a-77d6-43f6-8705-3c21f04a09ea", + "metadata": {}, + "source": [ + "As previously mentioned, the PaliGemma 2 weights are gated, meaning that you need to pull the Hugging Face Hub token first from the Secret Manager, and then with that token download the weights with `huggingface_hub.snapshot_download`. Since the model directory is a remote mount, to speed things up instead of writing the files as you download them from the Hub into the remote mount (which will be slow), you can pull those locally first and then move those into the remote mount, which will be way faster." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "64cf3b83-9628-4a14-9a7a-59d9a42268b6", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "from typing import Optional\n", + "from kfp.dsl import Output, Model, component\n", + "\n", + "@component(\n", + " base_image=os.getenv(\"CONTAINER_URI\"),\n", + " packages_to_install=[\"google-cloud-secret-manager\"],\n", + ")\n", + "def download_model_from_hub(\n", + " pretrained_model_name_or_path: str,\n", + " base_model: Output[Model],\n", + " project_id: Optional[str] = None,\n", + " secret_id: Optional[str] = None,\n", + " version_id: Optional[str] = \"latest\",\n", + ") -> None:\n", + " \"\"\"This function downloads the model from the Hugging Face Hub into the local storage and then moves it to\n", + " the `output_model` path which is the path in the mounted bucket, as otherwise downloading directly into the\n", + " remote mount is slower because it involves network latency and overhead for each write operation, whereas\n", + " downloading to local disk first leverages faster local I/O speeds and then allows for optimized bulk transfer\n", + " to the remote storage.\n", + " \"\"\"\n", + " \n", + " import os\n", + " import shutil\n", + " import tempfile\n", + " \n", + " from google.cloud import secretmanager\n", + " from huggingface_hub import snapshot_download\n", + "\n", + " token = None\n", + " if project_id is not None and secret_id is not None:\n", + " client = secretmanager.SecretManagerServiceClient()\n", + " secret_name = f\"projects/{project_id}/secrets/{secret_id}/versions/{version_id}\"\n", + " response = client.access_secret_version(request={\"name\": secret_name})\n", + " token = response.payload.data.decode(\"UTF-8\")\n", + "\n", + " os.environ[\"HF_HUB_ENABLE_HF_TRANSFER\"] = \"1\"\n", + " with tempfile.TemporaryDirectory() as temp_dir:\n", + " snapshot_download(\n", + " repo_id=pretrained_model_name_or_path,\n", + " repo_type=\"model\",\n", + " token=token,\n", + " local_dir=temp_dir,\n", + " )\n", + " \n", + " shutil.copytree(temp_dir, base_model.path, dirs_exist_ok=True)" + ] + }, + { + "cell_type": "markdown", + "id": "99d21026-706a-41fc-97c6-a3db73cdbf72", + "metadata": {}, + "source": [ + "### 3. Distributed LoRA Fine-Tuning with Ray" + ] + }, + { + "cell_type": "markdown", + "id": "c0c53898-ed31-4ca8-bf17-bff81207f5ee", + "metadata": {}, + "source": [ + "At this stage, both the dataset and the base model have been downloaded and are stored within a bucket, and both artifacts are provided as an input to the fine-tuning component, meaning that there's a dependency on the graph, so that for the fine-tuning to start both artifacts need to be downloaded successfully. In this component, you will need to install `ray[train]` to leverage Ray within the fine-tuning script for distributed fine-tuning across multiple GPUs (4 x NVIDIA L4). The `train_func` i.e. the fine-tuning function, uses `peft` to create the \"fine-tuneable\" LoRA adapter and the `transformers.Trainer` to fine-tune it, and finally the model is saved into the artifact path for the output model i.e. written into the bucket." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dbf6b0ec-3e32-4b6a-9726-b847fc13eccc", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "from kfp.dsl import Dataset, Input, Output, Model, component\n", + "\n", + "@component(\n", + " base_image=os.getenv(\"CONTAINER_URI\"),\n", + " packages_to_install=[\"ray[train]\"],\n", + ")\n", + "def fine_tune_model(\n", + " dataset: Input[Dataset],\n", + " base_model: Input[Model],\n", + " hparams: dict,\n", + " fine_tuned_adapter: Output[Model],\n", + ") -> None:\n", + " import ray\n", + " from ray.train import ScalingConfig\n", + " from ray.train.torch import TorchTrainer\n", + "\n", + " def train_func(config: dict) -> None:\n", + " from typing import Any, Dict, List\n", + "\n", + " import torch\n", + " from datasets import load_from_disk\n", + " from peft import get_peft_model, LoraConfig\n", + " from transformers import (\n", + " PaliGemmaForConditionalGeneration,\n", + " PaliGemmaProcessor,\n", + " Trainer,\n", + " TrainingArguments,\n", + " )\n", + "\n", + " ds = load_from_disk(dataset.path)\n", + "\n", + " model = PaliGemmaForConditionalGeneration.from_pretrained(\n", + " base_model.path,\n", + " torch_dtype=torch.bfloat16,\n", + " _attn_implementation=\"eager\",\n", + " ).to(\"cuda\") # type: ignore\n", + "\n", + " lora_config = LoraConfig(\n", + " r=8,\n", + " target_modules=[\n", + " \"q_proj\",\n", + " \"o_proj\",\n", + " \"k_proj\",\n", + " \"v_proj\",\n", + " \"gate_proj\",\n", + " \"up_proj\",\n", + " \"down_proj\",\n", + " ],\n", + " task_type=\"CAUSAL_LM\",\n", + " )\n", + "\n", + " model = get_peft_model(model, lora_config)\n", + "\n", + " processor = PaliGemmaProcessor.from_pretrained(base_model.path)\n", + "\n", + " def collate_fn(examples: List[Dict[str, Any]]) -> Dict[str, torch.Tensor]:\n", + " texts = [\"answer en \" + example[\"question\"] for example in examples]\n", + " labels = [example[\"multiple_choice_answer\"] for example in examples]\n", + " images = [example[\"image\"].convert(\"RGB\") for example in examples]\n", + " tokens = processor(\n", + " text=texts,\n", + " images=images,\n", + " suffix=labels, # type: ignore\n", + " return_tensors=\"pt\",\n", + " padding=\"longest\",\n", + " ) # type: ignore\n", + " tokens.to(torch.bfloat16).to(\"cuda\")\n", + " return tokens\n", + "\n", + " hparams = dict(\n", + " # dataset-related args\n", + " dataloader_pin_memory=False,\n", + " remove_unused_columns=False,\n", + " # train-related args\n", + " bf16=True,\n", + " num_train_epochs=2,\n", + " per_device_train_batch_size=1,\n", + " gradient_accumulation_steps=8,\n", + " # hyperparams\n", + " warmup_steps=2,\n", + " learning_rate=2e-5,\n", + " weight_decay=1e-6,\n", + " adam_beta2=0.999,\n", + " optim=\"adamw_torch\",\n", + " # reporting args\n", + " report_to=[\"tensorboard\"],\n", + " logging_steps=100,\n", + " # save weights\n", + " save_strategy=\"epoch\",\n", + " output_dir=fine_tuned_adapter.path,\n", + " )\n", + " hparams.update(config)\n", + "\n", + " training_args = TrainingArguments(**hparams)\n", + " trainer = Trainer(\n", + " model=model,\n", + " args=training_args,\n", + " train_dataset=ds[\"train\"], # type: ignore\n", + " data_collator=collate_fn,\n", + " )\n", + "\n", + " trainer.train()\n", + "\n", + " # Save the model\n", + " trainer.save_model(fine_tuned_adapter.path)\n", + "\n", + " ray.init()\n", + "\n", + " scaling_config = ScalingConfig(num_workers=4, use_gpu=True)\n", + "\n", + " trainer = TorchTrainer(\n", + " train_loop_per_worker=train_func,\n", + " train_loop_config=hparams,\n", + " scaling_config=scaling_config,\n", + " )\n", + "\n", + " results = trainer.fit()\n", + " print(results)" + ] + }, + { + "cell_type": "markdown", + "id": "c8f501f1-ef73-4571-b144-d9a5ed17da96", + "metadata": {}, + "source": [ + "### 4. Merge LoRA Adapters into Base Model" + ] + }, + { + "cell_type": "markdown", + "id": "00360d37-e8fe-4fef-9709-4585d5a4d4ec", + "metadata": {}, + "source": [ + "Optionally, one can decide whether to merge the fine-tuned LoRA adapter into the base model or not; meaning that this component expects both the base model and the fine-tuned LoRA adapter as the input and produces the merged adapter as the output, only if the boolean flag `merge_adapter` is set to True within the pipeline arguments." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6b3c3007-e7af-4262-9947-ba80f93612a1", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "from kfp.dsl import Input, Output, Model, component\n", + "\n", + "@component(base_image=os.getenv(\"CONTAINER_URI\"))\n", + "def merge_adapter_into_base_model(\n", + " base_model: Input[Model],\n", + " fine_tuned_adapter: Input[Model],\n", + " fine_tuned_model: Output[Model],\n", + ") -> None:\n", + " import torch\n", + " from peft import PeftModel\n", + " from transformers import PaliGemmaForConditionalGeneration, PaliGemmaProcessor\n", + "\n", + " processor = PaliGemmaProcessor.from_pretrained(base_model.path)\n", + " processor.save_pretrained(fine_tuned_model.path)\n", + "\n", + " model = PaliGemmaForConditionalGeneration.from_pretrained(\n", + " base_model.path,\n", + " torch_dtype=torch.bfloat16,\n", + " device_map=\"auto\",\n", + " )\n", + " model = PeftModel.from_pretrained(model, fine_tuned_adapter.path)\n", + " model = model.merge_and_unload()\n", + " model.save_pretrained(fine_tuned_model.path)" + ] + }, + { + "cell_type": "markdown", + "id": "b4365c90-ef6b-4ce3-bff0-2d197db85bbd", + "metadata": {}, + "source": [ + "### Vertex Pipeline Definition" + ] + }, + { + "cell_type": "markdown", + "id": "1e3b6a76-e998-4293-9669-37e15204ea5d", + "metadata": {}, + "source": [ + "Once the components are created, next is to define a function with the `kfp.dsl.pipeline` decorator that will define how the different steps are interconnected between each other and which compute requirements do each of those components need.\n", + "\n", + "As already mentioned, both the download of the PaliGemma 2 base model and the dataset are going to be the first steps, and those are required to succeed before the fine-tuning step is triggered; whilst finally providing a conditional step based on `merge_adapter` that merges the fine-tuned LoRA adapter into the base model if True." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "aaf1e58f-8337-4ab0-a615-f339386485bc", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "from kfp.dsl import If, pipeline\n", + "\n", + "@pipeline(\n", + " name=\"fine-tune-paligemma-2\",\n", + " pipeline_root=f\"gs://{os.getenv('BUCKET_NAME')}\",\n", + ")\n", + "def pipeline_fn(\n", + " dataset_id: str,\n", + " pretrained_model_name_or_path: str,\n", + " split: Optional[str] = None,\n", + " test_size: Optional[float] = 0.1,\n", + " project_id: Optional[str] = None,\n", + " secret_id: Optional[str] = None,\n", + " version_id: Optional[str] = \"latest\",\n", + " hparams: Optional[dict] = None,\n", + " merge_adapter: Optional[bool] = False,\n", + ") -> None:\n", + " download_dataset_from_hub_task = (\n", + " download_dataset_from_hub( # type: ignore\n", + " dataset_id=dataset_id,\n", + " split=split,\n", + " test_size=test_size,\n", + " )\n", + " .set_cpu_limit(\"4\")\n", + " .set_memory_limit(\"16G\")\n", + " )\n", + "\n", + " download_model_from_hub_task = (\n", + " download_model_from_hub( # type: ignore\n", + " pretrained_model_name_or_path=pretrained_model_name_or_path,\n", + " project_id=project_id,\n", + " secret_id=secret_id,\n", + " version_id=version_id,\n", + " )\n", + " .set_cpu_limit(\"4\")\n", + " .set_memory_limit(\"16G\")\n", + " )\n", + "\n", + " fine_tune_model_task = (\n", + " fine_tune_model( # type: ignore\n", + " dataset=download_dataset_from_hub_task.outputs[\"dataset\"],\n", + " base_model=download_model_from_hub_task.outputs[\"base_model\"],\n", + " hparams=hparams,\n", + " )\n", + " .add_node_selector_constraint(\"NVIDIA_L4\")\n", + " .set_accelerator_limit(4)\n", + " )\n", + "\n", + " with If(merge_adapter == True, \"merge_adapter=True\"):\n", + " (\n", + " merge_adapter_into_base_model(\n", + " base_model=download_model_from_hub_task.outputs[\"base_model\"],\n", + " fine_tuned_adapter=fine_tune_model_task.outputs[\"fine_tuned_adapter\"],\n", + " )\n", + " .add_node_selector_constraint(\"NVIDIA_L4\")\n", + " .set_accelerator_limit(1)\n", + " )" + ] + }, + { + "cell_type": "markdown", + "id": "dfa4686d-5f25-498a-a385-8b2903d2c939", + "metadata": {}, + "source": [ + "## Run Vertex Pipeline" + ] + }, + { + "cell_type": "markdown", + "id": "13dc62b4-1911-4c17-8ae5-274f0c28d438", + "metadata": {}, + "source": [ + "Finally, you can compile and run the pipeline on Vertex AI as a `PipelineJob`. To compile the Kubeflow Pipeline you are going to use the `kfp.compiler` which will shrink the `pipeline_fn` created above into a Kubeflow formatted YAML file that contains the pipeline specification, the code for the different steps as defined above, and how the different pipeline steps are interconnected; that when compiled will generate a YAML file that can be used within the `PipelineJob` mentioned above that will run on Vertex AI.\n", + "\n", + "> [!NOTE]\n", + "> The Kubeflow Compiler needs to be run once within the same pipeline so that the YAML file is created, but then you won't need to define all the steps / code again, but just to run the pipeline via the exported YAML, meaning that you can freely share the YAML for reproducible runs on any Kubeflow-compatible server (in this case being Vertex Pipelines). So on, the compilation needs to be done every time there are changes on the code of the pipeline, but is not required if there are no code changes." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2a2917b2-5018-405b-8ecb-29a086b043c7", + "metadata": {}, + "outputs": [], + "source": [ + "from kfp import compiler\n", + "\n", + "compiler.Compiler().compile(\n", + " pipeline_func=pipeline_fn, # type: ignore\n", + " package_path=\"pipeline.yaml\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "bd9a953b-09ec-466c-b5d4-8441a7b9ddd1", + "metadata": {}, + "source": [ + "Once the pipeline is compiled, a `pipeline.yaml` file will be generated and the `PipelineJob` can be created and submitted so that it runs on Vertex AI.\n", + "\n", + "In this case, the caching is enabled via `enable_caching=True` meaning that the steps will run only if not ran already i.e. when downloading the dataset from the Hugging Face Hub, even another step fails or if the pipeline is re-run anytime in the future, it will be cached, so it will just run once. Caching is sometimes useful, specially when everything's prepared to be reproducible and pinned to a given revision, but sometimes you may want to re-run a given stage to e.g. pull the latest data, so be mindful about the usage of the caching as it may not always be useful towards your given use case.\n", + "\n", + "Besides that, you will be providing the following `parameter_values` (that you can tweak if needed being mindful that those are indeed conditioned by the code defined above):\n", + "\n", + "- `pretrained_model_name_or_path` is set to [`google/paligemma2-3b-pt-448`](https://huggingface.co/google/paligemma2-3b-pt-448) but could be set to any pre-trained PaliGemma or PaliGemma 2 model available on the Hub on any resolution and size, but taking into consideration that the specifications for the instance on the fine-tuning step should be modified accordingly i.e. a bigger model variant as [`google/paligemma2-28b-pt-896`](https://huggingface.co/google/paligemma2-28b-pt-896) would also required a bigger instance with more or more powerful GPUs.\n", + "- `dataset_id` is set to [`merve/vqav2-small`](https://huggingface.co/datasets/merve/vqav2-small) (as per [the reference example from smol-vision](https://github.com/merveenoyan/smol-vision/blob/main/Fine_tune_PaliGemma.ipynb)) which is a smaller subset of [`HuggingFaceM4/VQAv2`](https://huggingface.co/datasets/HuggingFaceM4/VQAv2), but could be set to any dataset with the columns `multiple_choice_answer`, `question`, and `image`. Alternatively you could modify the `collate_fn` function within the `fine_tune_model` function as previously explained to handle different inputs whilst always expecting a query, an image and a completion.\n", + "- `hparams` are the fine-tuning hyper-parameters that are provided as the Ray configuration for the fine-tuning function, and those comply with the params defined as part of [`transformers.TrainingArguments`](https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.TrainingArguments).\n", + "- `split` and `test_size` are set to `validation` and `0.5`, but those values will vary depending on the dataset you are using.\n", + "- `project_id`, `secret_id` and `version_id` are the values related to the Hugging Face Hub Token that was previously set as secret on Google Cloud Secret Manager, and used to pull the token from the Secret Manager.\n", + "- `merge_adapter` is a boolean flag on whether to merge the LoRA adapters into the base model or not. For more information check the [Transformers Tutorials for \"Load adapters with 🤗 PEFT\"](https://huggingface.co/docs/transformers/main/en/peft).\n", + "\n", + "Once all the arguments are defined, you can call the `submit` method that will asynchronously trigger the pipeline execution on Vertex AI using the previously created Service Account with the required permissions " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "12fe0053-cfc0-4e24-ad97-c39393399276", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "from google.cloud.aiplatform import pipeline_jobs\n", + "\n", + "pipeline_job = pipeline_jobs.PipelineJob(\n", + " display_name=\"fine-tune-paligemma-2\",\n", + " template_path=\"pipeline.yaml\",\n", + " enable_caching=True, # set to False if you want to disable compontent caching\n", + " project=os.getenv(\"PROJECT_ID\"),\n", + " location=os.getenv(\"LOCATION\"),\n", + " parameter_values={\n", + " # model arguments\n", + " \"pretrained_model_name_or_path\": \"google/paligemma2-3b-pt-448\",\n", + " # dataset arguments\n", + " \"dataset_id\": \"merve/vqav2-small\",\n", + " \"split\": \"validation\",\n", + " \"test_size\": 0.3,\n", + " # fine-tuning hyper-params\n", + " \"hparams\": {\"num_train_epochs\": 1},\n", + " # for pulling secrets from the secret manager\n", + " \"project_id\": os.getenv(\"PROJECT_ID\"),\n", + " \"secret_id\": os.getenv(\"SECRET_ID\"),\n", + " \"version_id\": \"latest\",\n", + " # whether to merge the adapters into the base model or not\n", + " \"merge_adapter\": True,\n", + " },\n", + ")\n", + "\n", + "pipeline_job.submit(service_account=f\"{os.getenv('SERVICE_ACCOUNT_NAME')}@{os.getenv('PROJECT_ID')}.iam.gserviceaccount.com\")" + ] + }, + { + "cell_type": "markdown", + "id": "c0bd0818-69e4-4eae-9335-ff0484cbbf05", + "metadata": {}, + "source": [ + "Once submitted, if you navigate to Vertex AI on the Google Cloud Console, you will see the Vertex Pipeline as it follows:\n", + "\n", + "![Vertex Pipeline triggered on Google Cloud](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/google-cloud/examples/vertex-ai/pipelines/fine-tune-paligemma-2-with-pytorch/vertex-pipeline.png)" + ] + }, + { + "cell_type": "markdown", + "id": "3166dd69-0f87-4add-a078-6d6053d28459", + "metadata": {}, + "source": [ + "Finally, once the Vertex Pipeline is completed you will see that all the steps succeeded, and you can inspect the generated artifacts, as those come with the path to the Google Cloud Storage (GCS) bucket where those are stored. For both the dataset and the model artifacts, you can already use those directly within Google Cloud, or push those to the Hugging Face Hub too! 😉\n", + "\n", + "![Vertex Pipeline succeeded on Google Cloud](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/google-cloud/examples/vertex-ai/pipelines/fine-tune-paligemma-2-with-pytorch/vertex-pipeline-succeeded.png)" + ] + }, + { + "cell_type": "markdown", + "id": "f31d40fd-9914-495b-bf0d-989e53883d79", + "metadata": {}, + "source": [ + "## References" + ] + }, + { + "cell_type": "markdown", + "id": "d795b50f-e5cf-45de-815f-311c027be4e7", + "metadata": {}, + "source": [ + "For more information on Vertex AI Pipelines you can check:\n", + "\n", + "- [Vertex AI Documentation: Introduction to Vertex AI Pipelines](https://cloud.google.com/vertex-ai/docs/pipelines/introduction)\n", + "- [Google Cloud Vertex AI Samples: Notebooks, code samples, sample apps, and other resources that demonstrate how to use, develop and manage machine learning and generative AI workflows using Google Cloud Vertex AI](https://github.com/GoogleCloudPlatform/vertex-ai-samples/tree/main/notebooks/official/pipelines)\n", + "\n", + "And for more information on PaliGemma 2 and VLM fine-tuning you can check:\n", + "\n", + "- [PaliGemma 2 Hugging Face Announcement: Welcome PaliGemma 2 – New vision language models by Google](https://huggingface.co/blog/paligemma2)\n", + "- [Google Developers Blog: Introducing PaliGemma 2: Powerful Vision-Language Models, Simple Fine-Tuning](https://developers.googleblog.com/en/introducing-paligemma-2-powerful-vision-language-models-simple-fine-tuning)\n", + "- [Merve Noyan's smol-vision: Recipes for shrinking, optimizing, customizing cutting edge vision models](https://github.com/merveenoyan/smol-vision)\n", + "- [Aritra Roy Gosthipaty's Notebooks for Fine-tuning PaliGemma](https://github.com/ariG23498/fine-tune-paligemma)\n", + "- [Hugging Face's cookbook: Open-source AI cookbook](https://github.com/huggingface/cookbook)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.14" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}