Skip to content

Commit 16f52c6

Browse files
Add FilBench blogpost (#3016)
* Add FilBench blogpost * Add FilBench thumbnail * Add images * Fix embed size * Italicize some Filipino words * Revert changes to other lines * Hide example image to make leaderboard main visual * Correct the paths for image * Check if bullet-points are fixed based on preview * Fix bibtex citation
1 parent df35b5b commit 16f52c6

File tree

3 files changed

+155
-0
lines changed

3 files changed

+155
-0
lines changed

_blog.yml

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6495,6 +6495,21 @@
64956495
- trl
64966496
- vlm
64976497
- vision
6498+
6499+
- local: filbench
6500+
title: "🇵🇭 FilBench - Can LLMs Understand and Generate Filipino?"
6501+
author: ljvmiranda921
6502+
thumbnail: /blog/assets/filbench/thumbnail.png
6503+
date: Aug 12, 2025
6504+
tags:
6505+
- open-source
6506+
- LLM
6507+
- community
6508+
- evaluation
6509+
- filipino
6510+
- tagalog
6511+
- cebuano
6512+
- philippines
64986513

64996514
- local: accelerate-nd-parallel
65006515
title: "Accelerate ND-Parallel: A Guide to Efficient Multi-GPU Training"

assets/filbench/thumbnail.png

189 KB
Loading

filbench.md

Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
---
2+
title: "🇵🇭 FilBench - Can LLMs Understand and Generate Filipino?"
3+
thumbnail: /blog/assets/filbench/thumbnail.png
4+
authors:
5+
- user: ljvmiranda921
6+
guest: true
7+
org: UD-Filipino
8+
- user: acocodes
9+
guest: true
10+
org: UD-Filipino
11+
- user: connermanuel
12+
guest: true
13+
org: UD-Filipino
14+
- user: jcblaise
15+
guest: true
16+
org: UD-Filipino
17+
- user: jcblaise
18+
guest: true
19+
org: SEACrowd
20+
- user: josephimperial
21+
guest: true
22+
org: SEACrowd
23+
- user: davanstrien
24+
guest: false
25+
- user: SaylorTwift
26+
guest: false
27+
- user: clefourrier
28+
guest: false
29+
---
30+
31+
# 🇵🇭 FilBench - Can LLMs Understand and Generate Filipino?
32+
33+
As large language models (LLMs) become increasingly integrated into our lives, it becomes crucial to assess whether they reflect the nuances and capabilities of specific language communities.
34+
For example, Filipinos are among the most active ChatGPT users globally, ranking fourth in ChatGPT traffic (behind the United States, India, and Brazil [[1](https://blogs.worldbank.org/en/digital-development/who-on-earth-is-using-generative-ai-)] [[2](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4715603)]), but despite this strong usage, we lack a clear understanding of how LLMs perform for their languages, such as Tagalog and Cebuano.
35+
Most of the existing evidence is anecdotal, such as screenshots of ChatGPT responding in Filipino as proof that it is fluent.
36+
What we need instead is a systematic evaluation of LLM capabilities in Philippine languages.
37+
38+
<!-- <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/filbench/filbench-intro.png" style="width: 100%;"/> -->
39+
40+
That’s why we developed FilBench: a comprehensive evaluation suite to assess the capabilities of LLMs for Tagalog, Filipino (the standardized form of Tagalog), and Cebuano, on fluency, linguistic and translation abilities, as well as specific cultural knowledge.
41+
42+
We used it to evaluate 20+ state-of-the-art LLMs on FilBench, providing a comprehensive assessment of their performance in Philippine languages:
43+
44+
<iframe
45+
src="https://ud-filipino-filbench-leaderboard.hf.space"
46+
frameborder="0"
47+
width="850"
48+
height="450"
49+
></iframe>
50+
51+
52+
- 📄 Paper: https://arxiv.org/abs/2508.03523
53+
- 🖥️ GitHub: https://github.com/filbench/filbench-eval
54+
55+
## FilBench
56+
57+
The FilBench evaluation suite contains four major categories–Cultural Knowledge, Classical NLP, Reading Comprehension, and Generation–divided into 12 tasks.
58+
For example, the Classical NLP category includes tasks such as sentiment analysis, whereas Generation tasks include different aspects of translation.
59+
In order to ensure that these categories reflect the priorities and trends in NLP research and usage, we curate them based on a historical survey of NLP research on Philippine languages from 2006 to early 2024.
60+
(Most of these categories exclusively contain non-translated content to ensure faithfulness to the natural use of Philippine languages.)
61+
62+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/filbench/filbench-main.png" style="width: 100%;"/>
63+
64+
- **Cultural Knowledge:** This category tests a language model's ability to recall factual and culturally specific information. For Cultural Knowledge, we curated a variety of examples that test an LLM's regional and factual knowledge (Global-MMLU), Filipino-centric values (KALAHI), and ability to disambiguate word sense (StingrayBench).
65+
- **Classical NLP:** This category encompasses a variety of information extraction and linguistic tasks, such as named entity recognition, sentiment analysis, and text categorization, that specialized, trained models traditionally performed. In this category, we include instances from CebuaNER, TLUnified-NER, and Universal NER for named entity recognition, and subsets of SIB-200 and BalitaNLP for text categorization and sentiment analysis.
66+
- **Reading Comprehension:** This category evaluates a language model's ability to understand and interpret Filipino text, focusing on tasks such as readability, comprehension, and natural language inference. For this category, we include instances from the Cebuano Readability Corpus, Belebele, and NewsPH NLI.
67+
- **Generation:** We dedicate a large portion of FilBench to testing an LLM's capability to faithfully translate texts, either from English to Filipino or from Cebuano to English. We include a diverse set of test examples ranging from documents (NTREX-128), realistic texts from volunteers (Tatoeba), and domain-specific text (TICO-19).
68+
69+
Each of these categories provides an aggregated metric.
70+
To create a single representative score, we compute the weighted average based on the number of examples in each category, which we call the FilBench Score.
71+
72+
To simplify usage and set up, we built FilBench on top of [Lighteval](https://github.com/huggingface/lighteval), an all-in-one framework for LLM evaluation.
73+
For language-specific evaluation, we first defined translation pairs from English to Tagalog (or Cebuano) for common terms used in evaluation such as "yes" (*oo*), "no" (*hindi*), and "true" (*totoo*) among others.
74+
Then, we used the provided templates to implement custom tasks for the capabilities we care about.
75+
76+
FilBench is now available as a set of community tasks in the official Lighteval repository!
77+
78+
## What did we learn from FilBench?
79+
80+
By evaluating several LLMs on FilBench, we uncovered several insights into how they perform in Filipino.
81+
82+
### Finding #1: Although region-specific LLMs still lag behind GPT-4, collecting data to train these models is still a promising direction
83+
84+
In the past few years, we have seen an increase in region-specific LLMs that target Southeast Asian languages (SEA-specific), such as SEA-LION and SeaLLM.
85+
These are open-weight LLMs that you can freely download from HuggingFace.
86+
We find that SEA-specific LLMs are often the most parameter-efficient for our languages, achieving the highest FilBench scores compared to other models of their size.
87+
However, the best SEA-specific model is still outperformed by closed-source LLMs like GPT-4o.
88+
89+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/filbench/filbench-finding-1.png" style="width: 100%;"/>
90+
91+
Building region-specific LLMs still makes sense, as we observe performance gains of 2-3% when continuously fine-tuning a base LLM with SEA-specific instruction-tuning data.
92+
This suggests that **efforts to curate Filipino/SEA-specific training data for fine-tuning remain relevant**, as they can lead to better performance on FilBench.
93+
94+
### Finding #2: Filipino translation is still a difficult task for LLMs
95+
96+
We also observe that across the four categories on FilBench, most models struggle with Generation capabilities.
97+
Upon inspecting failure modes in Generation, we find that these include cases where the model fails to follow translation instructions, generates overly verbose texts, or hallucinates another language instead of Tagalog or Cebuano.
98+
99+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/filbench/filbench-finding-2.png" style="width: 100%;"/>
100+
101+
### Finding #3: Open LLMs Remain a Cost-Effective Choice for Filipino Language Tasks
102+
103+
The Philippines tends to have limited internet infrastructure and lower average incomes [[3](https://unesdoc.unesco.org/ark:/48223/pf0000393860?posInSet=1&queryId=cb72b22d-9dd3-44cd-9090-c4c89328a09c)], necessitating accessible LLMs that are cost- and compute-efficient.
104+
Through FilBench, we were able to identify LLMs that are on the Pareto frontier of efficiency.
105+
106+
In general, we find that open-weight LLMs, i.e., models that you can freely download from HuggingFace, are way cheaper than commercial models without sacrificing their performance.
107+
If you want an alternative to GPT-4o for your Filipino language tasks, then try Llama 4 Maverick!
108+
109+
<div align="center">
110+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/filbench/filbench-finding-3.png" style="width: 60%;"/>
111+
</div>
112+
113+
We also make this information available in the HuggingFace space of the FilBench leaderboard.
114+
115+
## Does your LLM work on Philippine Languages? Try it on FilBench!
116+
117+
We hope that FilBench provides deeper insights into LLM capabilities for Philippine languages and serves as a catalyst for advancing Filipino NLP research and development.
118+
The FilBench evaluation suite is built on top of Hugging Face's lighteval, allowing LLM developers to easily evaluate their models on our benchmark.
119+
For more information, please visit the links below:
120+
121+
- 📄 Paper: https://arxiv.org/abs/2508.03523
122+
- 🖥️ GitHub: https://github.com/filbench/filbench-eval
123+
124+
## Acknowledgements
125+
126+
The authors would like to thank Cohere Labs for providing credits through the Cohere Research Grant to run the Aya model series, and Together AI for additional computational credits for running several open models.
127+
We also acknowledge the Hugging Face team, particularly the OpenEvals team (Clémentine Fourrier and Nathan Habib) and Daniel van Strien, for their support in publishing this blog post.
128+
129+
## Citation
130+
131+
If you are evaluating on FilBench, please cite our work:
132+
133+
```bibtex
134+
@article{filbench,
135+
title={Fil{B}ench: {C}an {LLM}s {U}nderstand and {G}enerate {F}ilipino?},
136+
author={Miranda, Lester James V and Aco, Elyanah and Manuel, Conner and Cruz, Jan Christian Blaise and Imperial, Joseph Marvin},
137+
journal={arXiv preprint arXiv:2508.03523},
138+
year={2025}
139+
}
140+
```

0 commit comments

Comments
 (0)