Added an example Notebook to fine-tune Llama3 model using PyTorchJob #2419

aishwaryaraimule21 · 2025-02-05T15:01:19Z

What this PR does / why we need it:

This PR adds a distributed training example where a Llama model is finetuned on Yelp dataset using a Kubeflow Pipeline.

review-notebook-app · 2025-02-05T15:01:25Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

andreyvelich

Thank you for this effort @aishwaryaraimule21!
I am fine with merging this KFP example.
Any thoughts @johnugeorge @tenzen-y @Electronic-Waste @astefanutti ?

andreyvelich · 2025-02-15T00:20:02Z

examples/pytorch/text-classification/Fine-Tune-Llama3-LLM.ipynb

+    "    )\n",
+    "    \n",
+    "    # check the status of the job\n",
+    "    from kubeflow.pytorchjob import PyTorchJobClient\n",


Should you use TrainingClient here ?

Updated the PR. Now using TrainingClient().get_job_conditions() to fetch the job status.

Electronic-Waste · 2025-02-15T06:25:32Z

I have no objections:)

google-oss-prow · 2025-02-15T16:03:33Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign terrytangyuan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tenzen-y · 2025-02-15T16:23:47Z

Thank you for this effort @aishwaryaraimule21! I am fine with merging this KFP example. Any thoughts @johnugeorge @tenzen-y @Electronic-Waste @astefanutti ?

In that case, what are the relationship Training examples in KFP repository something like https://github.com/kubeflow/pipelines/tree/472f8779ded18f8904c5cbe15c0573d461d57af5/components/kubeflow/pytorch-launcher?

andreyvelich · 2025-02-15T19:40:22Z

Thank you for this effort @aishwaryaraimule21! I am fine with merging this KFP example. Any thoughts @johnugeorge @tenzen-y @Electronic-Waste @astefanutti ?

In that case, what are the relationship Training examples in KFP repository something like https://github.com/kubeflow/pipelines/tree/472f8779ded18f8904c5cbe15c0573d461d57af5/components/kubeflow/pytorch-launcher?

I think, you can use PyTorch launcher or you can directly use kubeflow-training SDK in the lightweight KFP component.
It is up to the user to decide.

tenzen-y · 2025-02-15T19:53:46Z

Thank you for this effort @aishwaryaraimule21! I am fine with merging this KFP example. Any thoughts @johnugeorge @tenzen-y @Electronic-Waste @astefanutti ?

In that case, what are the relationship Training examples in KFP repository something like https://github.com/kubeflow/pipelines/tree/472f8779ded18f8904c5cbe15c0573d461d57af5/components/kubeflow/pytorch-launcher?

I think, you can use PyTorch launcher or you can directly use kubeflow-training SDK in the lightweight KFP component. It is up to the user to decide.

SGTM.
It would be great if we could provide comprehensive examples after we release the consolidated SDK (I know the first version of SDK will be contained only katib and trainer features).

andreyvelich · 2025-02-17T15:18:09Z

@aishwaryaraimule21 Can you sign the DCO please ?

Signed-off-by: aishwarya.raimule <[email protected]>

aishwaryaraimule21 · 2025-02-18T07:34:57Z

@andreyvelich I have signed the DCO. Please check. Thanks.

astefanutti · 2025-02-18T08:25:34Z

examples/pytorch/text-classification/Fine-Tune-Llama3-LLM.ipynb

+    "\n",
+    "In this component, use TrainingClient() to create PyTorchJob which will fine-tune Llama3 model on 1 worker with 1 GPU.\n",
+    "\n",
+    "Specify the required packages in the *dsl.component* decorator. We would need kubeflow-pytorchjob, kubeflow-training[huggingface] and numpy packages in this Kubeflow component.\n",


Is kubeflow-pytorchjob really necessary since TrainingClient is used now?

astefanutti · 2025-02-18T08:26:00Z

examples/pytorch/text-classification/Fine-Tune-Llama3-LLM.ipynb

+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "@dsl.component(packages_to_install=['kubeflow-pytorchjob', 'kubeflow-training[huggingface]','numpy<1.24'])\n",


Dito, is kubeflow-pytorchjob really necessary since TrainingClient is used now?

astefanutti · 2025-02-18T08:27:28Z

examples/pytorch/text-classification/Fine-Tune-Llama3-LLM.ipynb

+    "       ),\n",
+    "       # it is assumed for text related tasks, you have 'text' column in the dataset.\n",
+    "       # for more info on how dataset is loaded check load_and_preprocess_data function in sdk/python/kubeflow/trainer/hf_llm_training.py\n",
+    "       dataset_provider_parameters=HuggingFaceDatasetParams(repo_id=\"aishwaryayyy/events_data\"),\n",


It would be better to remove dependencies on user specific repository.

replaced the user specific repository with https://huggingface.co/datasets/Yelp/yelp_review_full

astefanutti · 2025-02-18T08:29:16Z

examples/pytorch/text-classification/Fine-Tune-Llama3-LLM.ipynb

+    "       name=\"llama-3-1-8b-kubecon\",\n",
+    "       num_workers=1,\n",
+    "       num_procs_per_worker=1,\n",
+    "       # specify the storage class if you don't want to use the default one for the storage-initializer PVC\n",


It would be useful to mention a provisioner capable of provisioning RWX PVC is needed when distributing the training on multiple nodes / workers.

astefanutti · 2025-02-18T08:30:23Z

examples/pytorch/text-classification/Fine-Tune-Llama3-LLM.ipynb

+    "           \"storage_class\": \"nfs-storage\",\n",
+    "       },\n",
+    "       model_provider_parameters=HuggingFaceModelParams(\n",
+    "           model_uri=\"hf://meta-llama/Llama-3.1-8B-Instruct\",\n",


Should we cover the distributed training case, and provide the configuration so the model does not get downloaded on each local node / worker?

done, covered the distributed training case using 2 workers.

coveralls · 2025-02-18T13:25:07Z

Pull Request Test Coverage Report for Build 13375853453

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 100.0%

Totals
Change from base Build 13314191840:	0.0%
Covered Lines:	85
Relevant Lines:	85

💛 - Coveralls

andreyvelich · 2025-04-29T18:17:26Z

Hi @aishwaryaraimule21, did you get a chance to address the @astefanutti feedback, so we can merge this example to the release-1.9 branch ?

Signed-off-by: aishwarya.raimule <[email protected]>

… output paths Signed-off-by: aishwarya.raimule <[email protected]>

Signed-off-by: aishwarya.raimule <[email protected]>

…fine-tuned model Signed-off-by: aishwarya.raimule <[email protected]>

…orkers Signed-off-by: aishwarya.raimule <[email protected]>

…inetune_model component Signed-off-by: aishwarya.raimule <[email protected]>

aishwaryaraimule21 · 2025-07-19T05:39:13Z

@andreyvelich I have tested the distributed training workflow using an older trainer image of release-1.9 branch.
With the latest trainer package, I am running into an OOM error for the same TrainingArgs and hardware setup.
I see the transformers package in the release-1.9 branch got upgraded from 4.38.0 to 4.50.2 during this time.
f58e893#diff-3bbef68e7a1f42b8d4d1ef6f0%5B%E2%80%A6%5D9a1adeb7a3ec7e2a0f4d153c79276R3

Signed-off-by: aishwarya.raimule <[email protected]>

andreyvelich · 2025-08-10T22:58:53Z

@andreyvelich I have tested the distributed training workflow using an older trainer image of release-1.9 branch. With the latest trainer package, I am running into an OOM error for the same TrainingArgs and hardware setup. I see the transformers package in the release-1.9 branch got upgraded from 4.38.0 to 4.50.2 during this time. f58e893#diff-3bbef68e7a1f42b8d4d1ef6f0%5B%E2%80%A6%5D9a1adeb7a3ec7e2a0f4d153c79276R3

Do you want to try to update other packages and try it again @aishwaryaraimule21 ?

aishwaryaraimule21 · 2025-08-17T18:40:34Z

Do you want to try to update other packages and try it again @aishwaryaraimule21 ?

Yes, @andreyvelich. Let me try updating other packages. I tried running this example for smaller models like SmolLM2-135M-Instruct. The training is successful but they fail at the subsequent steps. I haven't gotten a chance to debug this. I will try to fix it this week. Thanks!

google-oss-prow bot requested review from jinchihe and kuizhiqing February 5, 2025 15:01

google-oss-prow bot added the size/L label Feb 5, 2025

andreyvelich reviewed Feb 15, 2025

View reviewed changes

aishwaryaraimule21 force-pushed the finetune-llama3-llm branch from 891bb0c to d62081a Compare February 17, 2025 17:32

aishwaryaraimule21 added 4 commits February 17, 2025 23:05

Added notebook to fine-tune llama3 llm

11c7bd5

Signed-off-by: aishwarya.raimule <[email protected]>

changed cell language to Python

d9ea540

Signed-off-by: aishwarya.raimule <[email protected]>

use TrainingClient to fetch job status

e26a729

Signed-off-by: aishwarya.raimule <[email protected]>

renamed storage_class

9664bed

Signed-off-by: aishwarya.raimule <[email protected]>

aishwaryaraimule21 force-pushed the finetune-llama3-llm branch from d62081a to b65cfbf Compare February 17, 2025 17:35

astefanutti reviewed Feb 18, 2025

View reviewed changes

aishwaryaraimule21 added 5 commits July 19, 2025 02:16

shortened pipeline name to avoid issues in pod naming

32a7326

Signed-off-by: aishwarya.raimule <[email protected]>

removed kubeflow-pytorchjob dependency

bfc65fb

Signed-off-by: aishwarya.raimule <[email protected]>

update Fine-Tune Llama3 notebook to use Yelp dataset and adjust model…

612a149

… output paths Signed-off-by: aishwarya.raimule <[email protected]>

update PVC access mode to ReadWriteMany for shared storage

bc70c68

Signed-off-by: aishwarya.raimule <[email protected]>

update inference request example in Fine-Tune Llama3 notebook to use …

6a2e749

…fine-tuned model Signed-off-by: aishwarya.raimule <[email protected]>

aishwaryaraimule21 force-pushed the finetune-llama3-llm branch from 2bd4910 to 6a2e749 Compare July 18, 2025 20:47

aishwaryaraimule21 added 2 commits July 19, 2025 02:31

update training configuration to enable distributed training with 2 w…

959d953

…orkers Signed-off-by: aishwarya.raimule <[email protected]>

remove specific version constraint for kubeflow-training package in f…

2595782

…inetune_model component Signed-off-by: aishwarya.raimule <[email protected]>

aishwaryaraimule21 added 2 commits July 26, 2025 01:47

update access mode for the pytorchjob pod

8904655

Signed-off-by: aishwarya.raimule <[email protected]>

update cell description, make optim training arg optional

1de0653

Signed-off-by: aishwarya.raimule <[email protected]>

aishwaryaraimule21 closed this Sep 3, 2025

Added an example Notebook to fine-tune Llama3 model using PyTorchJob #2419

Added an example Notebook to fine-tune Llama3 model using PyTorchJob #2419

Uh oh!

Conversation

aishwaryaraimule21 commented Feb 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

review-notebook-app bot commented Feb 5, 2025

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

andreyvelich Feb 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Electronic-Waste commented Feb 15, 2025

Uh oh!

google-oss-prow bot commented Feb 15, 2025

Uh oh!

tenzen-y commented Feb 15, 2025

Uh oh!

andreyvelich commented Feb 15, 2025

Uh oh!

tenzen-y commented Feb 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andreyvelich commented Feb 17, 2025

Uh oh!

aishwaryaraimule21 commented Feb 18, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coveralls commented Feb 18, 2025

Pull Request Test Coverage Report for Build 13375853453

Details

💛 - Coveralls

Uh oh!

andreyvelich commented Apr 29, 2025

Uh oh!

aishwaryaraimule21 commented Jul 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andreyvelich commented Aug 10, 2025

Uh oh!

aishwaryaraimule21 commented Aug 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

aishwaryaraimule21 commented Feb 5, 2025 •

edited

Loading

andreyvelich Feb 15, 2025 •

edited

Loading

tenzen-y commented Feb 15, 2025 •

edited

Loading

aishwaryaraimule21 commented Jul 19, 2025 •

edited

Loading

aishwaryaraimule21 commented Aug 17, 2025 •

edited

Loading