Skip to content

Commit 72134d5

Browse files
chore: benchmark
1 parent 5799fed commit 72134d5

14 files changed

+416978
-8
lines changed

README.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -94,7 +94,8 @@ Returns: A Hugging Face Dataset or DatasetDict object containing the processed p
9494

9595
**Load Dataset from data dict**
9696
```python
97-
prompt_instance = Template()
97+
from gemma_template import gemma_template
98+
9899
data_dict = [
99100
{
100101
"id": "JnZJolR76_u2",
@@ -107,12 +108,14 @@ data_dict = [
107108
"main_points": ["Main point 1", "Main point 2"],
108109
}
109110
]
110-
dataset = prompt_instance.load_dataset(data_dict, output_format='text') # enum: `text`, `alpaca` and `openai`.
111+
dataset = gemma_template.load_dataset(data_dict, output_format='text') # enum: `text`, `alpaca` and `openai`.
111112
print(dataset['text'][0])
112113
```
113114

114115
**Load Dataset from HuggingFace**
115116
```python
117+
from gemma_template import gemma_template
118+
116119
dataset = gemma_template.load_dataset(
117120
"YOUR_JSON_FILE_PATH_OR_HUGGINGFACE_DATASET",
118121
# enum: `text`, `alpaca` and `openai`.

data/test.json

Lines changed: 32697 additions & 0 deletions
Large diffs are not rendered by default.

data/train.json

Lines changed: 323455 additions & 0 deletions
Large diffs are not rendered by default.

docs/benchmark.md

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,76 @@
11
# Benchmark
2+
3+
## Goal
4+
5+
An article typically consists of the following elements: **title**, **description**, **main points**, **categories**, and **tags**.
6+
This fine tuning is designed to address these elements in a single request using a **custom structured format**. Additional objectives include:
7+
8+
1. Enhancing the model's ability to process and respond effectively in the **local language**.
9+
2. Improving the structured output format for better usability.
10+
11+
### Gemma 2B - Evaluation Results
12+
13+
The performance of the **Gemma 2B model** was assessed using **ROUGE** and **Google BLEU** metrics.
14+
15+
| Rouge1 | Rouge2 | RougeL | RougeLSum | Google BLEU |
16+
|--------|:------:|:------:|:---------:|:-----------:|
17+
| 0.722 | 0.524 | 0.456 | 0.703 | 0.345 |
18+
19+
**Key Observations**:
20+
21+
* The model shows significant improvements in:
22+
* Handling user language responses.
23+
* Structured content generation.
24+
25+
* **Challenges**:
26+
* Incomplete feedback for certain articles.
27+
* Occasional duplication of keywords in responses.
28+
29+
For more details, you can read version 1 of this Notebook: ([**Version 1**](https://www.kaggle.com/code/bigfishdev/gemma-2b-it-fine-tuning-with-gemma-template?scriptVersionId=216121364)). Also, you can download the file `gemma-benchmark/gemma_2b_eval_benchmark.json` which I have attached on this notebook.
30+
31+
Due to Kaggle limitations, I am currently unable to implement **ROUGE** and **Google BLEU** eval evaluations on the **Gemma 2B IT** ([**Version 2**](https://www.kaggle.com/code/bigfishdev/gemma-2b-it-fine-tuning-with-gemma-template?scriptVersionId=216252050)) model that I have refined.
32+
I will show a demo how I eval the dataset and the source code at the end of this notebook.
33+
34+
* **Kaggle Gemma 2B Model:**
35+
* Model: [https://www.kaggle.com/models/bigfishdev/gemma-template-gemma-2b](https://www.kaggle.com/models/bigfishdev/gemma-template-gemma-2b)
36+
* Notebook Version 1: [https://www.kaggle.com/code/bigfishdev/gemma-2b-it-fine-tuning-with-gemma-template?scriptVersionId=216121364](https://www.kaggle.com/code/bigfishdev/gemma-2b-it-fine-tuning-with-gemma-template?scriptVersionId=216121364)
37+
* **Kaggle Gemma 2B IT Model:**
38+
* Model: [https://www.kaggle.com/models/bigfishdev/gemma-template-gemma-2b-it](https://www.kaggle.com/models/bigfishdev/gemma-template-gemma-2b-it)
39+
* Notebook Version 2: [https://www.kaggle.com/code/bigfishdev/gemma-2b-it-fine-tuning-with-gemma-template?scriptVersionId=216252050](https://www.kaggle.com/code/bigfishdev/gemma-2b-it-fine-tuning-with-gemma-template?scriptVersionId=216252050)
40+
* **Dataset:** [https://www.kaggle.com/datasets/bigfishdev/gemma-template](https://www.kaggle.com/datasets/bigfishdev/gemma-template)
41+
* **Benchmark:** All benchmarks will be updated at my Github repo: [https://github.com/thewebscraping/gemma-template/blob/main/docs/benchmark.md](https://github.com/thewebscraping/gemma-template/blob/main/docs/benchmark.md)
42+
43+
### Gemma 2B - Vietnamese VMLU Evaluation Results
44+
VMLU is a benchmark suite designed to evaluate foundational models with a focus on the **Vietnamese language**.
45+
46+
| ID | Created At | Stem | Social Science | Humanities | Others | AVG | Unanswered |
47+
|---------------------|:----------------:|:-----:|:--------------:|:----------:|:------:|:-----:|:----------:|
48+
| 1624257089558187281 | 05/01/2025 17:56 | 20.14 | 29.35 | 29.84 | 25.76 | 25.61 | 1497 |
49+
50+
#### Results:
51+
* Out of 9,834 attempts, 1,497 responses were unanswered.
52+
* The dataset and evaluation results can be downloaded from the file: `gemma-benchmark/gemma_2b_vmlu_answers.csv`. Although it is not within the scope of this fine tuning.
53+
54+
### Gemma 2B IT - Vietnamese VMLU Evaluation Results
55+
56+
| ID | Created At | Stem | Social Science | Humanities | Others | AVG | Unanswered |
57+
|---------------------|:----------------:|:-----:|:--------------:|:----------:|:------:|:-----:|:----------:|
58+
| 1840435368978448913 | 06/01/2025 19:04 | 36.11 | 43.45 | 41.92 | 39.06 | 39.64 | 82 |
59+
60+
#### Results:
61+
* Out of 9,834 attempts, 82 responses were unanswered.
62+
* The dataset and evaluation results can be downloaded from the file: `gemma-benchmark/gemma_2b_it_vmlu_benchmark.csv`. Although it is not within the scope of this fine tuning.
63+
64+
#### My Gemma Fine Tuning VMLU Score:
65+
66+
![Screenshot VMLU_Gemma_Fine_Tuning.png](images/Screenshot_VMLU_Gemma_Fine_Tuning.png)
67+
68+
#### VMLU Leaderboard Score:
69+
There is a clear difference between the VMLU rankings in the Gemma 2B IT fine tuning, the score is close to the score of the **Gemma 7B IT** model. Here is a screenshot of the **VMLU Leaderboard** rankings:
70+
71+
![Screenshot VMLU_Gemma_Fine_Tuning.png](images/Screenshot_VMLU_Leaderboard.png)
72+
73+
#### Additional Resources:
74+
* VMLU Website: [https://vmlu.ai/](https://vmlu.ai/)
75+
* VMLU Leaderboard: [https://vmlu.ai/leaderboard](https://vmlu.ai/leaderboard)
76+
* VMLU Github Repository: [https://github.com/ZaloAI-Jaist/VMLU/](https://github.com/ZaloAI-Jaist/VMLU/)

docs/eval_data/gemma_2b_eval_benchmark.json

Lines changed: 40946 additions & 0 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)