|
1 | 1 | # Benchmark |
| 2 | + |
| 3 | +## Goal |
| 4 | + |
| 5 | +An article typically consists of the following elements: **title**, **description**, **main points**, **categories**, and **tags**. |
| 6 | +This fine tuning is designed to address these elements in a single request using a **custom structured format**. Additional objectives include: |
| 7 | + |
| 8 | +1. Enhancing the model's ability to process and respond effectively in the **local language**. |
| 9 | +2. Improving the structured output format for better usability. |
| 10 | + |
| 11 | +### Gemma 2B - Evaluation Results |
| 12 | + |
| 13 | +The performance of the **Gemma 2B model** was assessed using **ROUGE** and **Google BLEU** metrics. |
| 14 | + |
| 15 | +| Rouge1 | Rouge2 | RougeL | RougeLSum | Google BLEU | |
| 16 | +|--------|:------:|:------:|:---------:|:-----------:| |
| 17 | +| 0.722 | 0.524 | 0.456 | 0.703 | 0.345 | |
| 18 | + |
| 19 | +**Key Observations**: |
| 20 | + |
| 21 | +* The model shows significant improvements in: |
| 22 | + * Handling user language responses. |
| 23 | + * Structured content generation. |
| 24 | + |
| 25 | +* **Challenges**: |
| 26 | + * Incomplete feedback for certain articles. |
| 27 | + * Occasional duplication of keywords in responses. |
| 28 | + |
| 29 | +For more details, you can read version 1 of this Notebook: ([**Version 1**](https://www.kaggle.com/code/bigfishdev/gemma-2b-it-fine-tuning-with-gemma-template?scriptVersionId=216121364)). Also, you can download the file `gemma-benchmark/gemma_2b_eval_benchmark.json` which I have attached on this notebook. |
| 30 | + |
| 31 | +Due to Kaggle limitations, I am currently unable to implement **ROUGE** and **Google BLEU** eval evaluations on the **Gemma 2B IT** ([**Version 2**](https://www.kaggle.com/code/bigfishdev/gemma-2b-it-fine-tuning-with-gemma-template?scriptVersionId=216252050)) model that I have refined. |
| 32 | +I will show a demo how I eval the dataset and the source code at the end of this notebook. |
| 33 | + |
| 34 | +* **Kaggle Gemma 2B Model:** |
| 35 | + * Model: [https://www.kaggle.com/models/bigfishdev/gemma-template-gemma-2b](https://www.kaggle.com/models/bigfishdev/gemma-template-gemma-2b) |
| 36 | + * Notebook Version 1: [https://www.kaggle.com/code/bigfishdev/gemma-2b-it-fine-tuning-with-gemma-template?scriptVersionId=216121364](https://www.kaggle.com/code/bigfishdev/gemma-2b-it-fine-tuning-with-gemma-template?scriptVersionId=216121364) |
| 37 | +* **Kaggle Gemma 2B IT Model:** |
| 38 | + * Model: [https://www.kaggle.com/models/bigfishdev/gemma-template-gemma-2b-it](https://www.kaggle.com/models/bigfishdev/gemma-template-gemma-2b-it) |
| 39 | + * Notebook Version 2: [https://www.kaggle.com/code/bigfishdev/gemma-2b-it-fine-tuning-with-gemma-template?scriptVersionId=216252050](https://www.kaggle.com/code/bigfishdev/gemma-2b-it-fine-tuning-with-gemma-template?scriptVersionId=216252050) |
| 40 | +* **Dataset:** [https://www.kaggle.com/datasets/bigfishdev/gemma-template](https://www.kaggle.com/datasets/bigfishdev/gemma-template) |
| 41 | +* **Benchmark:** All benchmarks will be updated at my Github repo: [https://github.com/thewebscraping/gemma-template/blob/main/docs/benchmark.md](https://github.com/thewebscraping/gemma-template/blob/main/docs/benchmark.md) |
| 42 | + |
| 43 | +### Gemma 2B - Vietnamese VMLU Evaluation Results |
| 44 | +VMLU is a benchmark suite designed to evaluate foundational models with a focus on the **Vietnamese language**. |
| 45 | + |
| 46 | +| ID | Created At | Stem | Social Science | Humanities | Others | AVG | Unanswered | |
| 47 | +|---------------------|:----------------:|:-----:|:--------------:|:----------:|:------:|:-----:|:----------:| |
| 48 | +| 1624257089558187281 | 05/01/2025 17:56 | 20.14 | 29.35 | 29.84 | 25.76 | 25.61 | 1497 | |
| 49 | + |
| 50 | +#### Results: |
| 51 | +* Out of 9,834 attempts, 1,497 responses were unanswered. |
| 52 | +* The dataset and evaluation results can be downloaded from the file: `gemma-benchmark/gemma_2b_vmlu_answers.csv`. Although it is not within the scope of this fine tuning. |
| 53 | + |
| 54 | +### Gemma 2B IT - Vietnamese VMLU Evaluation Results |
| 55 | + |
| 56 | +| ID | Created At | Stem | Social Science | Humanities | Others | AVG | Unanswered | |
| 57 | +|---------------------|:----------------:|:-----:|:--------------:|:----------:|:------:|:-----:|:----------:| |
| 58 | +| 1840435368978448913 | 06/01/2025 19:04 | 36.11 | 43.45 | 41.92 | 39.06 | 39.64 | 82 | |
| 59 | + |
| 60 | +#### Results: |
| 61 | +* Out of 9,834 attempts, 82 responses were unanswered. |
| 62 | +* The dataset and evaluation results can be downloaded from the file: `gemma-benchmark/gemma_2b_it_vmlu_benchmark.csv`. Although it is not within the scope of this fine tuning. |
| 63 | + |
| 64 | +#### My Gemma Fine Tuning VMLU Score: |
| 65 | + |
| 66 | + |
| 67 | + |
| 68 | +#### VMLU Leaderboard Score: |
| 69 | +There is a clear difference between the VMLU rankings in the Gemma 2B IT fine tuning, the score is close to the score of the **Gemma 7B IT** model. Here is a screenshot of the **VMLU Leaderboard** rankings: |
| 70 | + |
| 71 | + |
| 72 | + |
| 73 | +#### Additional Resources: |
| 74 | +* VMLU Website: [https://vmlu.ai/](https://vmlu.ai/) |
| 75 | +* VMLU Leaderboard: [https://vmlu.ai/leaderboard](https://vmlu.ai/leaderboard) |
| 76 | +* VMLU Github Repository: [https://github.com/ZaloAI-Jaist/VMLU/](https://github.com/ZaloAI-Jaist/VMLU/) |
0 commit comments