Skip to content

Commit d0aae5b

Browse files
committed
Merge remote-tracking branch 'origin/main' into project/addsimplefinetune
2 parents ab22dc7 + a0eee84 commit d0aae5b

35 files changed

+5244
-26
lines changed

.gitignore

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -150,6 +150,18 @@ zencoder/cloned_public_repos
150150
llm-lora-finetuning/ckpt/
151151
llm-lora-finetuning/data_generation/
152152
llm-lora-finetuning/datagen/
153-
nohup.out
153+
fiftyone-ls-demo/
154+
llm-lora-finetuning/mistral-zenml-finetune/
154155
.flashrank_cache
155-
156+
bge-base-financial-matryoshka/
157+
embeddings
158+
llm-lora-finetuning/meta-llama/
159+
llm-lora-finetuning/microsoft/
160+
llm-lora-finetuning/unsloth/
161+
llm-lora-finetuning/configs/shopify.yaml
162+
finetuned-matryoshka/
163+
finetuned-all-MiniLM-L6-v2/
164+
finetuned-snowflake-arctic-embed-m/
165+
166+
# ollama ignores
167+
nohup.out

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ A list of updated and maintained projects by the ZenML team and the community:
7373
| [LLM RAG Pipeline with Langchain and OpenAI](llm-agents/) | NLP, LLMs | `slack` `langchain` `llama_index` |
7474
| [Orbit User Analysis](orbit-user-analysis) | Data Analysis, Tabular | - |
7575
| [Huggingface to Sagemaker](huggingface-sagemaker) | NLP | `pytorch` `mlflow` `huggingface` `aws` `s3` `kubeflow` `slack` `github` |
76-
| [Complete Guide to LLMs (from RAG to finetuning)](llm-complete-guide) | NLP, LLMs | `openai` `supabase` |
76+
| [Complete Guide to LLMs (from RAG to finetuning)](llm-complete-guide) | NLP, LLMs, embeddings, finetuning | `openai` `supabase` `huggingface` `argilla` |
7777
| [LLM LoRA Finetuning (Phi3 and Llama 3.1)](llm-lora-finetuning) | NLP, LLMs | `gcp` |
7878
| [ECP Price Prediction with GCP Cloud Composer](airflow-cloud-composer-etl-feature-train/README.md) | Regression, Airflow | `cloud-composer` `airflow` |
7979
| [Simple LLM finetuning with Lightning Studio](simple-llm-finetuning/README.md) | Lightning AI Studio, LLMs | `cloud-composer` `airflow` |

llm-complete-guide/README.md

Lines changed: 47 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -116,7 +116,7 @@ Note that Claude will require a different API key from Anthropic. See [the
116116
`litellm` docs](https://docs.litellm.ai/docs/providers/anthropic) on how to set
117117
this up.
118118

119-
### Run the evaluation pipeline
119+
### Run the LLM RAG evaluation pipeline
120120

121121
To run the evaluation pipeline, you can use the following command:
122122

@@ -127,6 +127,52 @@ python run.py --evaluation
127127
You'll need to have first run the RAG pipeline to have the necessary assets in
128128
the database to evaluate.
129129

130+
## Embeddings finetuning
131+
132+
For embeddings finetuning we first generate synthetic data and then finetune the
133+
embeddings. Both of these pipelines are described in [the LLMOps guide](https://docs.zenml.io/v/docs/user-guide/llmops-guide/finetuning-embeddings) and
134+
instructions for how to run them are provided below.
135+
136+
### Run the `distilabel` synthetic data generation pipeline
137+
138+
To run the `distilabel` synthetic data generation pipeline, you can use the following commands:
139+
140+
```shell
141+
pip install -r requirements-argilla.txt # special requirements
142+
python run.py --synthetic
143+
```
144+
145+
You will also need to have set up and connected to an Argilla instance for this
146+
to work. Please follow the instructions in the [Argilla
147+
documentation](https://docs.argilla.io/latest/getting_started/quickstart/)
148+
to set up and connect to an Argilla instance on the Hugging Face Hub. [ZenML's
149+
Argilla integration
150+
documentation](https://docs.zenml.io/v/docs/stack-components/annotators/argilla)
151+
will guide you through the process of connecting to your instance as a stack
152+
component.
153+
154+
### Finetune the embeddings
155+
156+
To run the pipeline for finetuning the embeddings, you can use the following
157+
commands:
158+
159+
```shell
160+
pip install -r requirements-argilla.txt # special requirements
161+
python run.py --embeddings
162+
```
163+
164+
As with the previous pipeline, you will need to have set up and connected to an Argilla instance for this
165+
to work. Please follow the instructions in the [Argilla
166+
documentation](https://docs.argilla.io/latest/getting_started/quickstart/)
167+
to set up and connect to an Argilla instance on the Hugging Face Hub. [ZenML's
168+
Argilla integration
169+
documentation](https://docs.zenml.io/v/docs/stack-components/annotators/argilla)
170+
will guide you through the process of connecting to your instance as a stack
171+
component.
172+
173+
*Credit to Phil Schmid for his [tutorial on embeddings finetuning with Matryoshka
174+
loss function](https://www.philschmid.de/fine-tune-embedding-model-for-rag) which we adapted for this project.*
175+
130176
## ☁️ Running in your own VPC
131177

132178
The basic RAG pipeline will run using a local stack, but if you want to improve

llm-complete-guide/__init__.py

Whitespace-only changes.

llm-complete-guide/constants.py

Lines changed: 40 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,6 @@
1515
# limitations under the License.
1616
#
1717

18-
1918
# Vector Store constants
2019
CHUNK_SIZE = 2000
2120
CHUNK_OVERLAP = 50
@@ -35,3 +34,43 @@
3534
"claude3": "claude-3-opus-20240229",
3635
"claudehaiku": "claude-3-haiku-20240307",
3736
}
37+
38+
# CHUNKING_METHOD = "split-by-document"
39+
CHUNKING_METHOD = "split-by-header"
40+
DATASET_NAME = f"zenml/rag_qa_embedding_questions_{CHUNKING_METHOD}"
41+
MODEL_PATH = "all-MiniLM-L6-v2"
42+
# MODEL_PATH = "embedding-data/distilroberta-base-sentence-transformer"
43+
NUM_EPOCHS = 30
44+
WARMUP_STEPS = 0.1 # 10% of train data
45+
NUM_GENERATIONS = 2
46+
EVAL_BATCH_SIZE = 64
47+
48+
DUMMY_DATASET_NAME = "embedding-data/sentence-compression"
49+
# DUMMY_MODEL_PATH = "embedding-data/distilroberta-base-sentence-transformer"
50+
DUMMY_MODEL_PATH = "all-MiniLM-L6-v2"
51+
DUMMY_EPOCHS = 10
52+
53+
# Markdown Loader constants
54+
FILES_TO_IGNORE = [
55+
"toc.md",
56+
]
57+
58+
# embeddings finetuning constants
59+
EMBEDDINGS_MODEL_NAME_ZENML = "finetuned-zenml-docs-embeddings"
60+
DATASET_NAME_DEFAULT = "zenml/rag_qa_embedding_questions_0_60_0"
61+
DATASET_NAME_DISTILABEL = f"{DATASET_NAME_DEFAULT}_distilabel"
62+
DATASET_NAME_ARGILLA = DATASET_NAME_DEFAULT.replace("zenml/", "")
63+
OPENAI_MODEL_GEN = "gpt-4o"
64+
OPENAI_MODEL_GEN_KWARGS_EMBEDDINGS = {
65+
"temperature": 0.7,
66+
"max_new_tokens": 512,
67+
}
68+
EMBEDDINGS_MODEL_ID_BASELINE = "Snowflake/snowflake-arctic-embed-m"
69+
EMBEDDINGS_MODEL_ID_FINE_TUNED = "finetuned-snowflake-arctic-embed-m"
70+
EMBEDDINGS_MODEL_MATRYOSHKA_DIMS: list[int] = [
71+
384,
72+
256,
73+
128,
74+
64,
75+
] # Important: large to small
76+
USE_ARGILLA_ANNOTATIONS = False

llm-complete-guide/data/test_dataset.json

Lines changed: 166 additions & 0 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)