-
Notifications
You must be signed in to change notification settings - Fork 885
Added an example Notebook to fine-tune Llama3 model using PyTorchJob #2419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added an example Notebook to fine-tune Llama3 model using PyTorchJob #2419
Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
andreyvelich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this effort @aishwaryaraimule21!
I am fine with merging this KFP example.
Any thoughts @johnugeorge @tenzen-y @Electronic-Waste @astefanutti ?
| " )\n", | ||
| " \n", | ||
| " # check the status of the job\n", | ||
| " from kubeflow.pytorchjob import PyTorchJobClient\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you use TrainingClient here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the PR. Now using TrainingClient().get_job_conditions() to fetch the job status.
|
I have no objections:) |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
In that case, what are the relationship Training examples in KFP repository something like https://github.com/kubeflow/pipelines/tree/472f8779ded18f8904c5cbe15c0573d461d57af5/components/kubeflow/pytorch-launcher? |
I think, you can use PyTorch launcher or you can directly use |
SGTM. |
|
@aishwaryaraimule21 Can you sign the DCO please ? |
891bb0c to
d62081a
Compare
Signed-off-by: aishwarya.raimule <[email protected]>
Signed-off-by: aishwarya.raimule <[email protected]>
Signed-off-by: aishwarya.raimule <[email protected]>
Signed-off-by: aishwarya.raimule <[email protected]>
d62081a to
b65cfbf
Compare
|
@andreyvelich I have signed the DCO. Please check. Thanks. |
| "\n", | ||
| "In this component, use TrainingClient() to create PyTorchJob which will fine-tune Llama3 model on 1 worker with 1 GPU.\n", | ||
| "\n", | ||
| "Specify the required packages in the *dsl.component* decorator. We would need kubeflow-pytorchjob, kubeflow-training[huggingface] and numpy packages in this Kubeflow component.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is kubeflow-pytorchjob really necessary since TrainingClient is used now?
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "@dsl.component(packages_to_install=['kubeflow-pytorchjob', 'kubeflow-training[huggingface]','numpy<1.24'])\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dito, is kubeflow-pytorchjob really necessary since TrainingClient is used now?
| " ),\n", | ||
| " # it is assumed for text related tasks, you have 'text' column in the dataset.\n", | ||
| " # for more info on how dataset is loaded check load_and_preprocess_data function in sdk/python/kubeflow/trainer/hf_llm_training.py\n", | ||
| " dataset_provider_parameters=HuggingFaceDatasetParams(repo_id=\"aishwaryayyy/events_data\"),\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to remove dependencies on user specific repository.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replaced the user specific repository with https://huggingface.co/datasets/Yelp/yelp_review_full
| " name=\"llama-3-1-8b-kubecon\",\n", | ||
| " num_workers=1,\n", | ||
| " num_procs_per_worker=1,\n", | ||
| " # specify the storage class if you don't want to use the default one for the storage-initializer PVC\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be useful to mention a provisioner capable of provisioning RWX PVC is needed when distributing the training on multiple nodes / workers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| " \"storage_class\": \"nfs-storage\",\n", | ||
| " },\n", | ||
| " model_provider_parameters=HuggingFaceModelParams(\n", | ||
| " model_uri=\"hf://meta-llama/Llama-3.1-8B-Instruct\",\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we cover the distributed training case, and provide the configuration so the model does not get downloaded on each local node / worker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, covered the distributed training case using 2 workers.
Pull Request Test Coverage Report for Build 13375853453Details
💛 - Coveralls |
|
Hi @aishwaryaraimule21, did you get a chance to address the @astefanutti feedback, so we can merge this example to the release-1.9 branch ? |
Signed-off-by: aishwarya.raimule <[email protected]>
Signed-off-by: aishwarya.raimule <[email protected]>
… output paths Signed-off-by: aishwarya.raimule <[email protected]>
Signed-off-by: aishwarya.raimule <[email protected]>
…fine-tuned model Signed-off-by: aishwarya.raimule <[email protected]>
2bd4910 to
6a2e749
Compare
…orkers Signed-off-by: aishwarya.raimule <[email protected]>
…inetune_model component Signed-off-by: aishwarya.raimule <[email protected]>
|
@andreyvelich I have tested the distributed training workflow using an older trainer image of |
Signed-off-by: aishwarya.raimule <[email protected]>
Signed-off-by: aishwarya.raimule <[email protected]>
Do you want to try to update other packages and try it again @aishwaryaraimule21 ? |
Yes, @andreyvelich. Let me try updating other packages. I tried running this example for smaller models like SmolLM2-135M-Instruct. The training is successful but they fail at the subsequent steps. I haven't gotten a chance to debug this. I will try to fix it this week. Thanks! |
What this PR does / why we need it:
This PR adds a distributed training example where a Llama model is finetuned on Yelp dataset using a Kubeflow Pipeline.