Korean language proficiency evaluation for LLM/SLM models using KMMLU, CLIcK, HAE-RAE, HRM8K, KoBALT, and KorMedMCQA dataset

📋 Overview

With the continuous emergence of various LLM/SLM models, there is a need for robust evaluation datasets for non-English languages such as Korean. KMMLU (Korean Massive Multi-task Language Understanding), CLIcK (Cultural and Linguistic Intelligence in Korean), HAE_RAE_BENCH 1.0, HRM8K, KoBALT, and KorMedMCQA fill this gap by providing rich, well-categorized datasets that focus on cultural, linguistic, mathematical reasoning, advanced linguistic phenomena, and medical knowledge, enabling detailed evaluation of Korean language models. This code performs benchmarking on these datasets with minimal time and effort.

CLIcK (Cultural and Linguistic Intelligence in Korean)

This dataset assesses Korean language proficiency in the subject areas of Korean Culture (History, Geography, Law, Politics, Society, Tradition, Economy, Pop culture) and Korean Language (Textual, Functional, Grammar). There are a total of 1,995 sample data in 11 categories. This dataset presents 4- or 5-choice multiple choice questions. Depending on the question, additional context is given.

Paper, GitHub, Hugging Face

HAE_RAE_BENCH 1.0

This dataset evaluates Korean language proficiency in the following 6 categories (General Knowledge, History, Loan Words, Rare Words, Reading Comprehension, Standard Nomenclature). Similar to CLiCK, the task is to solve multiple-choice questions, but no additional context. There are a total of 1,538 sample data in 6 categories.

Paper, GitHub, Hugging Face

KMMLU

The KMMLU dataset is a large-scale multi-task language understanding evaluation dataset in Korean. It is not a simple translation of the MMLU dataset, but rather data generated from Korean text, allowing us to evaluate how well LLM/SLM works in Korean. It consists of a total of 45 categories and 4 super categories, such as STEM, Appliced Science, HUMSS, and Other.

Paper, Hugging Face

KMMLU-HARD

This dataset is an extended version of the KMMLU dataset, with more challenging questions. It is designed to further evaluate the limits of Korean NLP models and contains questions that require a particularly high level of comprehension and reasoning skills.

Paper, Hugging Face

HRM8K

HRM8K (HAE-RAE Math 8K) is a bilingual math reasoning benchmark for Korean and English, comprising 8,011 instances. It includes Korean School Math (KSM) with 1,428 challenging problems from Korean Olympiad and competition-level exams, and Prior Sets with 6,583 problems from existing English benchmarks (GSM8K, MATH, Omni-MATH, MMMLU). This dataset evaluates mathematical reasoning capabilities in both languages.

Paper, Hugging Face

KoBALT-700

KoBALT (Korean Benchmark for Advanced Linguistic Tasks) is designed to evaluate LLMs on Korean linguistic phenomena. It consists of 700 expert-written multiple-choice questions (A-J, 10 choices) covering 24 fine-grained linguistic phenomena across 5 core domains: Syntax (300), Semantics (215), Pragmatics (81), Phonetics/Phonology (62), and Morphology (42). Questions are categorized into 3 difficulty levels (1: easy, 2: moderate, 3: hard).

Paper, GitHub, Hugging Face

KorMedMCQA

KorMedMCQA is a Korean Medical Multiple-Choice Question Answering benchmark derived from professional healthcare licensing examinations conducted in Korea between 2012 and 2024. The dataset contains 7,469 questions from examinations for doctor (2,489 questions), nurse (1,751 questions), pharmacist (1,817 questions), and dentist (1,412 questions), covering a wide range of medical disciplines. This dataset presents 5-choice multiple choice questions with answers numbered 1-5.

Paper, Hugging Face

🆕 What's New

Dec 26, 2025: Added Radar Chart Visualization for Korean LLM evaluation results with interactive charts showing performance by category/supercategory.
Dec 15, 2025: Added GPT-5.2 GPT-5.2 (medium) achieved the highest KMMLU-Hard score with 74.63% accuracy (0-shot).
Dec 6, 2025: Added HRM8K, KoBALT, and KorMedMCQA benchmark datasets. Added AWS Nova 2.0 Lite (launched at re:Invent 2025) KMMLU benchmark results. Notably, Nova 2.0 Lite, despite being a lightweight model, outperforms GPT-4.1 on KMMLU with 67.02% accuracy (0-shot) compared to GPT-4.1's 65.49%.
Nov 21, 2025: Added GPT-5.1, GPT-5.1-chat benchmark results. GPT-5.1 now supports reasoning_effort="none", while other 5.x models introduce a new "minimal" setting, transforming GPT-5.1 into a flexible spectrum rather than a single fixed-intelligence model. In our benchmark, GPT-5.1 (medium) achieved the highest KMMLU score with 83.65% accuracy (0-shot), whereas GPT-5.1 with the default none setting recorded a score that was approximately 20 percentage points lower, at 62.14% (0-shot).
Aug 11, 2025: Added GPT-5 family benchmark results. What is very impressive is the KMMLU score (0-shot 78.53% accuracy) and KMMLU-Hard score of GPT-5-mini (0-shot 61.68% accuracy). For KMMLU-Hard, many open-source models struggle to even surpass 30% accuracy.
Apr 17, 2025: Added GPT-4.1 family benchmark results. GPT-4.1-mini is an improvement over GPT-4o-mini and is closer to GPT-4o. GPT-4.1 outperforms GPT-4o.
Feb 28, 2025: Added Phi-4-mini-instruct benchmark results.
Feb 2, 2025: Added Phi-4 benchmark results / Added Azure AI Foundry deployment options. Phi-4 outperforms Phi-3.5-MoE in some metrics, such as CLIcK and KMMLU.
Aug 29, 2024: Added 5-shot experiments for KMMLU and KMMLU-HARD benchmark datasets. For Llama-3.1-8B-Instruct, adding an example with 5-shot does not give a proper answer based on Korean language. The results may vary depending on the experimental environment, but it seems that an appropriate system prompt is needed. (Please note that we did not use any system prompt.)
Aug 25, 2024: Added experimental results for KMMLU and KMMLU-HARD benchmark datasets. Added Phi-3-mini-128K-instruct (June version) benchmark results.
Aug 22, 2024: Added Phi-3-5-mini-instruct and Phi-3.5-MoE-instruct benchmark results. Phi-3.5 is Microsoft's latest open source model that has begun to properly support multiple languages, and its Korean performance has been greatly improved, as shown in the benchmark results below.
Aug 22, 2024: Added Llama-3-1-8B-instruct benchmark results. Of course, fine-tuned Llama-3.1 with Korean dataset may perform better, but we only compared it with the vanilla model.
Aug 9, 2024: Added Azure OpenAI GPT-3.5-turbo (2023-06-13), GPT-4-turbo (2024-04-09), GPT-4o (2024-05-13), and GPT-4o-mini (2024-07-18) benchmark results.

⚙️ Implementation

The code skeleton is based on https://github.com/corca-ai/evaluating-gpt-4o-on-CLIcK, but significant improvements have been made:

Multi-provider support: Azure OpenAI, AWS Bedrock, OpenAI, Azure ML, Azure AI Foundry, and Hugging Face
Parallel processing: Efficient batch processing with configurable concurrency and chunk-based file merging
Robust error handling: Content filtering, rate limiting, and throttling exception handling with configurable wait times
Advanced parsing: Custom output parsers supporting reasoning models with multiple response format detection
Adaptive prompts: Reasoning-aware system prompts with configurable effort levels (none/minimal/low/medium/high)
Comprehensive logging: Debug mode with detailed request/response logging for troubleshooting

📊 Visualization

Radar Chart Generator

# Creating a radar chart
uv run python -c "
from utils.radar_chart_generator import RadarChartGenerator
generator = RadarChartGenerator()
generator.generate_all_charts(['CLIcK', 'KMMLU', 'HAERAE'], top_n=10)
"

# Or use a juputer notebook
jupyter notebook radar_chart_visualization.ipynb

Radar Charts

CLIcK Performance by Category	HAERAE Performance by Supercategory

KMMLU Performance by Category	KMMLU-Hard Performance by Category

📈 Results

Notes

The numbers in the table below are the average accuracy (%). For Azure OpenAI models, a few questions are filtered out due to the content filtering feature, but this only happens between 1-5 samples in the entire dataset, so the impact is not significant.

The prompt is the same as the CLIcK paper prompt. The experimental results may vary depending on the system prompt, context, and parameters. The experimental results below were given with max_tokens=512, temperature=0.01 without using few-shot, context, or system prompt.

Important for HRM8K (especially OMNI_MATH subset):

OMNI_MATH problems are extremely complex and require significantly more tokens
Recommended settings: --max_tokens 5000 or higher
For reasoning models (e.g., Nova with reasoning), consider using lower reasoning effort to reduce token consumption
If you encounter stopReason: 'max_tokens', increase the token limit further

Reasoning Configuration:

REASONING_ENABLED=true enables reasoning mode for system prompts across all providers
REASONING_EFFORT levels: none (minimal reasoning), minimal (1-2 sentences), low (2-3 sentences), medium (3-4 sentences), high (4-6 sentences)
For Bedrock Nova models, reasoning config is also applied to the model's native reasoning capability
For other providers, reasoning is handled through enhanced system prompts

Throttling Protection:

WAIT_TIME environment variable controls delay when throttling errors occur (default: 30 seconds)
Only activates on ThrottlingException, Too many requests, or similar throttling errors
Normal processing continues without delays

Since most of them are ChatCompletion or instruction fine-tuned models, the variation may be large compared to the results of other group's experiments. However, our experimental results show that the trend follows similarly under the same experimental conditions. (e.g., GPT-4o: 70.57/GPT-4o-mini: 60.31 in Experimental Condition 1; GPT-4o: 67.76/GPT-4o-mini: 57.53 in Experimental Condition 2).

Model Version

Click to view model versions

GPT-5.2 (medium): 2025-12-11 model version (reasoning_effort="medium")
GPT-5.2: 2025-12-11 model version (reasoning_effort="none" as default)
Nova 2.0 Lite: us.amazon.nova-2-lite-v1:0 model (reasoning mode: medium)
GPT-5.1 (medium): 2025-11-13 model version (reasoning_effort="medium")
GPT-5.1: 2025-11-13 model version (reasoning_effort="none" as default)
GPT-5.1-chat: 2025-11-13 model version
GPT-5-chat: 2025-08-08 model version
GPT-5-mini: 2025-08-08 model version (reasoning_effort="medium" as default)
GPT-5-nano: 2025-08-08 model version (reasoning_effort="medium" as default)
GPT-4.1: 2025-04-14 model version
GPT-4.1-mini: 2025-04-14 model version
GPT-4.1-nano: 2025-04-14 model version
GPT-4o: 2024-05-13 model version
GPT-4o-mini: 2024-07-18 model version
GPT-4-turbo: 2024-04-09 model version
GPT-3.5-turbo: 2023-06-13 model version

CLIcK

Proprietary models

Accuracy by supercategory

supercategory	GPT-5.2(medium)	GPT-5.2	Nova 2 Pro - Preview(medium)	Nova 2 Lite(medium)	GPT-5.1(medium)	GPT-5.1	GPT-5.1-chat	GPT-5-chat	GPT-5-mini	GPT-5-nano	GPT-4.1	GPT-4.1-mini	GPT-4.1-nano	GPT-4o	GPT-4o-mini	GPT-4-turbo	GPT-3.5-turbo
Culture	89.95	83.72	71.82	66.02	89.96	82.93	79.3	82.96	81.51	72.78	81.65	72.81	62.47	81.89	70.95	73.61	53.38
Language	92.92	80.77	75.69	69.69	89.45	76.27	74.76	75.71	83.8	70.24	78.31	70.62	58.62	77.54	63.54	71.23	46
Overall	90.92	82.75	73.08	67.22	89.81	81	77.97	80.84	82.17	72.05	80.55	72.09	61.21	80.46	68.5	72.82	50.98

Click to view Accuracy by category

Accuracy by category

supercategory	category	GPT-5.2(medium)	GPT-5.2	Nova 2 Pro - Preview(medium)	Nova 2 Lite(medium)	GPT-5.1(medium)	GPT-5.1	GPT-5.1-chat	GPT-5-chat	GPT-5-mini	GPT-5-nano	GPT-4.1	GPT-4.1-mini	GPT-4.1-nano	GPT-4o	GPT-4o-mini	GPT-4-turbo	GPT-3.5-turbo
Culture	Economy	94.92	94.92	88.14	88.14	96.61	81.36	94.92	91.53	89.83	88.14	94.92	84.75	86.44	94.92	83.05	89.83	64.41
Culture	Geography	93.13	88.55	77.86	69.47	91.6	81.68	80.15	83.97	88.55	79.39	80.92	78.63	64.89	80.15	77.86	82.44	53.44
Culture	History	84.7	67.16	46.43	39.64	81.5	68.7	60.45	73.64	70.77	48.8	68.52	51.11	38.89	66.92	48.4	46.4	31.79
Culture	Law	85.39	75.34	58.45	52.05	85.84	80.37	70.32	67.58	67.12	54.34	71.23	58.45	48.4	70.78	57.53	61.19	41.55
Culture	Politics	90.48	88.1	85.71	80.95	91.67	89.29	84.52	91.67	84.52	85.71	89.29	82.14	75	88.1	83.33	89.29	65.48
Culture	Pop Culture	100	97.56	80.49	80.49	100	95.12	95.12	97.56	92.68	85.37	100	87.8	85.37	97.56	85.37	92.68	75.61
Culture	Society	94.17	91.91	89	84.14	93.2	89.64	89	91.26	90.94	88.35	91.26	86.73	77.67	92.88	85.44	86.73	71.2
Culture	Tradition	89.64	90.54	78.38	71.62	91.89	87.39	83.78	86.94	85.59	81.08	85.14	81.08	67.12	87.39	74.77	79.28	55.86
Language	Functional	94.4	90.4	81.6	74.4	92.86	78.57	85.71	85.71	92.86	78.57	87.2	73.6	53.6	84.8	64.8	80	40
Language	Grammar	87.92	62.5	58.33	47.5	84.05	57.76	55.17	58.62	73.71	56.9	60	50	40.42	57.08	42.5	47.5	30
Language	Textual	96.49	91.93	87.72	86.32	93.68	91.23	90.18	89.12	91.58	80.7	89.82	86.67	76.14	91.58	80.7	87.37	62.11

Open-Source models

Accuracy by supercategory

supercategory	gpt-oss-120b	Phi-4-mini-instruct	Phi-4	Phi-3.5-mini-instruct	Phi-3.5-MoE-instruct	Phi-3-mini-128k-instruct-June	Llama-3.1-8B-Instruct
Culture	60.28	43.05	57.84	43.77	58.44	29.74	51.15
Language	39.17	42.31	61.85	41.38	52.31	27.85	40.92
Overall	53.26	42.81	59.15	42.99	56.44	29.12	47.82

Click to view Accuracy by category

Accuracy by category

supercategory	category	gpt-oss-120b	Phi-4-mini-instruct	Phi-4	Phi-3.5-mini-instruct	Phi-3.5-MoE-instruct	Phi-3-mini-128k-instruct-June	Llama-3.1-8B-Instruct
Culture	Economy	91.53	59.32	76.27	61.02	77.97	28.81	66.1
Culture	Geography	-	39.69	55.73	45.8	60.31	29.01	54.2
Culture	History	20.09	27.5	27.5	26.15	33.93	30	29.64
Culture	Law	72.62	36.99	50.23	32.42	52.51	22.83	44.29
Culture	Politics	80.49	52.38	76.19	54.76	70.24	33.33	59.52
Culture	Pop Culture	78.96	63.41	70.73	60.98	80.49	34.15	60.98
Culture	Society	57.21	52.75	75.08	54.37	74.43	31.72	65.05
Culture	Tradition	57.21	45.5	66.67	47.75	58.11	31.98	54.95
Language	Functional	57.14	42.4	66.4	37.6	48	24	32.8
Language	Grammar	16.38	28.33	40	27.5	29.58	23.33	22.92
Language	Textual	56.84	54.04	78.25	54.74	73.33	33.33	59.65

HAE_RAE_BENCH 1.0

Proprietary models

category	GPT-5.2(medium)	GPT-5.2	Nova 2 Lite(medium)	GPT-5.1(medium)	GPT-5.1	GPT-5.1-chat	GPT-5-chat	GPT-5-mini	GPT-5-nano	GPT-4.1	GPT-4.1-mini	GPT-4.1-nano	GPT-4o	GPT-4o-mini	GPT-4-turbo	GPT-3.5-turbo
General Knowledge	89.2	81.82	53.98	90.34	75.57	71.59	69.28	82.39	72.16	75.57	52.27	43.18	77.27	53.41	66.48	40.91
History	98.94	94.15	75	96.62	91.07	91.77	89.86	91.14	85.96	93.62	89.89	64.89	92.02	84.57	78.72	30.32
Loan Words	91.12	80.47	78.11	89.35	79.88	75.74	82.39	86.98	86.98	79.29	73.96	73.37	79.88	76.33	78.11	59.17
Rare Words	88.89	87.9	68.64	88.15	87.16	85.68	86.84	78.02	75.31	88.15	83.7	76.54	87.9	81.98	79.01	61.23
Reading Comprehension	91.28	87.02	82.1	90.16	85.23	84.79	86.27	87.7	77.18	86.35	82.55	67.11	85.46	77.18	80.09	56.15
Standard Nomenclature	96.08	88.24	81.05	94.77	83.66	84.97	85.62	83.66	81.05	87.58	78.43	73.86	88.89	75.82	79.08	53.59
Overall	91.81	86.93	73.93	90.65	84.52	83.22	84.32	84.35	78.6	85.83	78.93	67.95	85.7	76.4	77.76	52.67

Open-Source models

category	gpt-oss-120b	Phi-4-mini-instruct	Phi-4	Phi-3.5-MoE-instruct	Phi-3.5-mini-instruct	Phi-3-mini-128k-instruct-June	Llama-3.1-8B-Instruct
General Knowledge	16.48	28.41	38.07	39.77	31.25	28.41	34.66
History	48.91	31.38	37.77	60.64	32.45	22.34	44.15
Loan Words	31.95	52.66	61.54	70.41	47.93	35.5	63.31
Rare Words	43.95	51.6	62.72	63.95	55.06	42.96	63.21
Reading Comprehension	34.9	43.18	71.14	64.43	42.95	41.16	51.9
Standard Nomenclature	56.21	54.9	63.4	66.01	44.44	32.68	58.82
Overall	38.66	44.47	59.23	61.83	44.21	36.41	53.9

KMMLU (0-shot)

Proprietary models

Accuracy by supercategory

supercategory	GPT-5.2	Nova 2.0 Lite	GPT-5.1	GPT-5.1(medium)	GPT-5.1-chat	GPT-5-chat	GPT-5-mini	GPT-5-nano	GPT-4.1	GPT-4.1-mini	GPT-4.1-nano	GPT-4o	GPT-4o-mini	GPT-4-turbo	GPT-3.5-turbo
Applied Science	68.92	68.06	62.14	83.65	59.96	62.64	76.94	69.97	61.98	56.21	45.84	61.52	49.29	55.98	38.47
HUMSS	75.79	63.24	72.32	83.7	68.18	70.78	75.21	66.44	71.7	64.11	51.56	69.45	56.59	63	40.9
Other	70.62	63.32	66.2	80.96	63.15	65.81	73.12	65.3	65.29	58.26	47.93	63.79	52.35	57.53	40.19
STEM	73.16	70.9	66.62	86.18	64.04	67.27	79.32	73.33	66.56	61.17	50.78	65.16	54.74	60.84	42.24
Overall	71.54	67.02	65.9	83.73	63.11	65.92	76.47	69.28	65.49	59.26	48.57	64.26	52.63	58.75	40.3

Click to view Accuracy by category

Accuracy by category

supercategory	category	GPT-5.2	Nova 2.0 Lite	GPT-5.1	GPT-5.1(medium)	GPT-5.1-chat	GPT-5-chat	GPT-5-mini	GPT-5-nano	GPT-4.1	GPT-4.1-mini	GPT-4.1-nano	GPT-4o	GPT-4o-mini	GPT-4-turbo	GPT-3.5-turbo
Applied Science	Aviation-Engineering-and-Maintenance	77.41	70.8	71.01	87.55	67.88	71.6	79.28	72.12	69.8	61.4	47.2	69.8	50.4	61.92	39
Applied Science	Electronics-Engineering	81.2	86.7	72.7	94.2	70.6	74.8	91.6	87.3	73.3	69.6	58.1	73.2	64.5	70.2	47.1
Applied Science	Energy-Management	58.67	63.6	54.06	82	51.47	54.18	72.96	66.46	53.7	45.4	36.3	50.3	39.5	43.6	33.9
Applied Science	Environmental-Science	54.32	62.4	45.91	75.53	42.35	46.56	70.23	61.96	47.8	41.9	35.8	45.82	36.5	42.7	30.51
Applied Science	Gas-Technology-and-Engineering	59.88	66.1	52.7	78.03	52.53	51.59	71.69	65.16	51.2	48.9	39.9	50.7	44.6	48.3	35
Applied Science	Geomatics	60.2	60.1	52.1	79.2	51.5	51.9	71	62.2	54.9	49.3	42.6	56.4	43.7	47.3	35.7
Applied Science	Industrial-Engineer	70.2	63	63.2	79	59.9	62	74	64.6	62	55.6	48.1	61.7	53.1	57	41.12
Applied Science	Machine-Design-and-Manufacturing	76.4	72.9	68.3	88.8	65.4	67.8	81.6	72.6	67.7	60.8	45.9	68.4	53	61	37.9
Applied Science	Maritime-Engineering	73.5	62.5	68.83	87.33	66.33	70.33	79.33	72.5	68.67	60.67	44.5	67	52.5	60.17	42.67
Applied Science	Nondestructive-Testing	69.98	61.5	64.9	76.84	62.14	64.75	69.79	66.84	63.9	59.9	48.1	65.2	48	59.7	35.8
Applied Science	Railway-and-Automotive-Engineering	63.55	66.1	54.95	83.65	50.51	57.45	77.11	69.79	55.9	50.3	42.9	53.9	40.9	48.6	32.5
Applied Science	Telecommunications-and-Wireless-Technology	82.5	78.8	76.5	89.6	77.1	78.6	83.8	78.1	77.6	72.5	60.1	77.8	66.1	73	52.24
HUMSS	Accounting	80	62	77	97	62	71	92	78	71	64	38	72	48	59	30
HUMSS	Criminal-Law	56.5	35	52.5	63	47	47.5	48	45	54	43	38.5	52.5	37	46.5	40.5
HUMSS	Economics	81.54	66.92	80	86.15	71.54	78.46	83.85	74.62	77.69	74.62	56.92	74.62	64.62	65.38	43.85
HUMSS	Education	89	77	88	93	80	85	89	77	86	80	59	83	70	69	43
HUMSS	Korean-History	69	29	55.56	82	52.22	55.56	65.56	42.22	53	39	33	56	32	41	26
HUMSS	Law	69	51.2	67.5	75.6	63.7	65.9	63.84	53.43	65.9	59.1	45.1	65.8	50.6	56.7	38.23
HUMSS	Management	81.4	70.4	77.4	89.5	74	76	82.9	73.7	78.3	69.2	59	74.2	64.2	70.7	47.9
HUMSS	Political-Science-and-Sociology	82	68.33	80	88.67	77.33	80.33	79.67	68	81	68.33	56	79.67	62.33	72.67	46.67
HUMSS	Psychology	74.3	57.6	71.3	81.1	64.8	67.8	70.5	61.2	68.3	59	45.6	66.4	50.4	57.3	32.8
HUMSS	Social-Welfare	82.5	84.2	76.9	92.4	76.5	78.4	89.9	86.1	78.3	76	62.8	74.5	67.7	73.2	46.4
HUMSS	Taxation	56.5	40	54	66	46.5	49.5	47	36	54	42.5	35.5	51	39.5	44	33.5
Other	Agricultural-Sciences	63.86	52.5	57.68	71.63	51.82	56.26	61.86	54.3	57	47.8	37.4	53.9	41.8	47.58	30.1
Other	Construction	61.44	55.3	52.89	73.94	50.82	53.64	63.67	55.4	52.2	47.4	38.9	50.7	41.8	46	33.2
Other	Fashion	71.59	55.2	67.76	76.84	63.57	67.07	69.18	60.91	67.7	57.9	48.4	65.8	51.1	58.28	39.7
Other	Food-Processing	70.7	61	67.4	79.9	64.95	66.87	74.24	66.5	66.5	59.6	45.3	64.9	51.7	57.4	37.3
Other	Health	82	79	80	92	86	82	83	78	84	74	63	80	69	70	50
Other	Interior-Architecture-and-Design	81.2	72.8	79.2	89.3	77.4	79.3	81.2	72.7	79.6	71.4	57.7	77.6	64	71.1	48.2
Other	Marketing	91.4	83.3	89.9	93	88.1	88.4	88.8	84.6	90.3	87.6	78.6	88.9	83.1	86.26	63.6
Other	Patent	61	35	61	65	46	60	52	29	57	51	45	52	33	46	40
Other	Public-Safety	60.66	57.1	57.73	76	52.05	56.52	64.86	57.21	53.4	46	39.6	53.2	44.3	48.2	36.2
Other	Real-Estate	66.5	51	65	71.5	59.5	61.5	66	57	62.5	54.5	48.5	64.5	49	56.5	38
Other	Refrigerating-Machinery	64.36	73.1	54.64	88.19	54.02	56.84	81.06	73.6	55.1	48.3	36.2	54.7	41.9	45.7	32.7
STEM	Biology	73.19	54.2	66.73	84.49	61.96	67.76	73.33	63.23	66.4	54.6	38.7	63.1	45.5	52.83	31.84
STEM	Chemical-Engineering	73.16	83.8	64.09	90.89	64.02	66.17	86.91	82.45	67.7	61.4	50.4	65.1	55.8	62.6	38.5
STEM	Chemistry	76.51	83.67	73.93	92.5	69.81	71.79	89.82	86.67	71.5	64.33	54.83	68.67	57.33	64.33	39.67
STEM	Civil-Engineering	62.9	60.7	54.5	79.6	51	56.6	70.5	65	52.9	50.4	42.7	54.3	46.9	50.3	38.3
STEM	Computer-Science	91.2	84.7	88.9	94.7	87.5	88.9	91	86.6	89.2	85	76.2	88.2	80.1	84.8	65.7
STEM	Ecology	71.9	60.4	66.7	77.5	61.9	65.1	69.1	61.6	64.5	58.8	53.3	64.4	54.3	60	42.9
STEM	Electrical-Engineering	53.82	57.7	45.8	74	43.2	48.3	64.1	57.4	46.2	43.9	35.8	46	38.5	44.1	31.7
STEM	Information-Technology	88.3	88.4	85.3	94.9	84.3	85.3	91.6	88.2	86	82.6	72.2	83.6	78.4	81.2	63.78
STEM	Materials-Engineering	80.95	66.9	74.33	87.77	71.65	73.47	78.18	70.6	72.4	64.6	47.9	69.4	52.1	64.55	39.69
STEM	Math	36.33	82.67	30	93	26.67	32.67	91.33	89	32	35.33	28.33	32.67	30	28.67	26.33
STEM	Mechanical-Engineering	72.1	70.1	60.1	86.98	59.08	61.63	80.21	73.06	61.1	55.1	44.1	60	46.9	54.7	34.2

Open-Source models

supercategory	Phi-4-mini	Phi-4	Phi-3.5-MoE-instruct	Phi-3.5-mini-instruct	Phi-3-mini-128k-instruct-June	Llama-3.1-8B-Instruct
Applied Science	36.46	47.97	45.15	35.8	31.68	37.03
HUMSS	35.52	54.27	49.75	31.56	26.47	37.29
Other	36.26	49.07	47.24	35.45	31.01	39.15
STEM	39.31	52	49.08	38.54	31.9	40.42
Overall	37.08	50.3	47.43	35.87	30.82	38.54

KMMLU-HARD (0-shot)

Proprietary models

Accuracy by supercategory

supercategory	GPT-5.2(medium)	GPT-5.2	Nova 2 Pro - Preview(medium)	Nova 2 Lite(medium)	GPT-5.1	GPT-5.1(medium)	GPT-5.1-chat	GPT-5-chat	GPT-5-mini	GPT-5-nano	GPT-4.1	GPT-4.1-mini	GPT-4.1-nano	GPT-4o	GPT-4o-mini	GPT-4-turbo	GPT-3.5-turbo
Applied Science	76.94	50.76	59.08	53.91	39.32	74.24	36.87	38.97	64.7	54.96	39.08	34.92	23.75	37.12	22.25	29.17	21.07
HUMSS	71.11	52.91	43.4	37.17	49.82	71.7	44.11	45.07	54.1	45.18	44.11	37.57	22.95	41.97	23.31	31.51	19.44
Other	68.33	48.81	46	41.81	44.96	66.36	41.85	43.51	53.69	43.82	44.34	33.64	24.82	40.39	26.48	29.59	22.22
STEM	80.42	52.11	63.82	56.18	43.33	77.88	41.6	41.98	67.48	60.37	44.45	36.55	25.64	39.82	26.36	32.18	20.91
Overall	74.63	51.1	54.07	48.2	43.9	72.76	40.83	42.12	60.61	51.72	42.79	35.6	24.34	39.62	24.56	30.56	20.97

Click to view Accuracy by category

Accuracy by category

supercategory	category	GPT-5.2(medium)	GPT-5.2	Nova 2 Pro - Preview(medium)	Nova 2 Lite(medium)	GPT-5.1	GPT-5.1(medium)	GPT-5.1-chat	GPT-5-chat	GPT-5-mini	GPT-5-nano	GPT-4.1	GPT-4.1-mini	GPT-4.1-nano	GPT-4o	GPT-4o-mini	GPT-4-turbo	GPT-3.5-turbo
Applied Science	Aviation-Engineering-and-Maintenance	81	57	63	52	41	77	43	37	69	57	39	40	22	45	25	32	25
Applied Science	Electronics-Engineering	92	60	81	80	45	92	46	48	91	81	47	49	26	42	29	41	22
Applied Science	Energy-Management	79	44	62	57	33	77	27	34	72	63	40	36	20	40	27	37	22.45
Applied Science	Environmental-Science	71.74	35.87	61	54	31.11	66.25	24.29	27.5	68.57	56	32	21	19	27.5	20	24	31.25
Applied Science	Gas-Technology-and-Engineering	78.12	54.17	52	55	33.33	71	38.75	32.5	60	58.89	30	30	23	35	23	28	13
Applied Science	Geomatics	84	48	55	51	32	78	30	36	66	47	44	35	27	34	22	24	22
Applied Science	Industrial-Engineer	54	34	47	29.73	37	54	33	34	53	38	31	25	20	32	19	26	25
Applied Science	Machine-Design-and-Manufacturing	84	57	64	53	53	84	39	44	70	54	46	36	24	45	17	27	16.25
Applied Science	Maritime-Engineering	80	51	59	43	34	73	38	41	54	50	38	34	25	32	19	26	21
Applied Science	Nondestructive-Testing	68	55	47	38.46	38	62	36	43	47	40	34	36	27	36	23	23	14
Applied Science	Railway-and-Automotive-Engineering	75	51	56	58	45	77	40	36	61	60	37	37	22	32	17	24	18
Applied Science	Telecommunications-and-Wireless-Technology	76	61	62	56	48	78	44	51	66	55	51	40	30	43	26	38	24
HUMSS	Accounting	95.65	73.91	65.22	54.35	58.7	91.3	45.65	41.3	82.61	73.91	50	45.65	13.04	50	21.74	21.74	13.04
HUMSS	Criminal-Law	48	40	26	25	38	49	27	29	33	29	32	28	19	31	16	22	27
HUMSS	Economics	73.81	66.67	50	45.24	73.81	88.1	47.62	61.9	57.14	52.38	57.14	50	30.95	52.38	26.19	38.1	26.19
HUMSS	Education	78.26	65.22	52.17	39.13	73.91	82.61	60.87	73.91	60.87	60.87	60.87	56.52	30.43	56.52	43.48	34.78	17.39
HUMSS	Korean-History	84.09	52.27	27.27	13.64	50	81.82	40.91	47.73	56.82	27.27	36.36	20.45	27.27	40.91	18.18	27.27	18.18
HUMSS	Law	62	50	26	20	45	60	39	43	43	30	38	37	21	40	20	28	15.85
HUMSS	Management	74	54	58	53	59	79	53	48	63	58	48	41	30	42	26	38	22
HUMSS	Political-Science-and-Sociology	74.44	57.78	43.33	40	57.78	77.78	53.33	60	63.33	48.89	57.78	32.22	21.11	53.33	27.78	38.89	23.33
HUMSS	Psychology	77	57	47	34.85	52	76	47	45	51	40	46	42	18	44	20	32	7
HUMSS	Social-Welfare	83	55	72	63	43	81	54	52	78	74	48	54	36	45	32	47	24
HUMSS	Taxation	59.38	38.54	22.92	21.88	34.38	56.25	31.25	26.04	30.21	23.96	31.25	21.88	12.5	28.12	18.75	17.71	17.71
Other	Agricultural-Sciences	67	52	51	39	51	66	33	44	52	38	49	32	26	38	23	28	17
Other	Construction	71	45	49	45	42	66	37	38	53	46	35	26	24	35	28	27	23
Other	Fashion	53	37	32	24	40	51	39	39	42	28	41	27	18	31	23	20	29
Other	Food-Processing	69	50	46	41	47	66	45	43	60	45	42	40	19	38	21	26	19
Other	Health	86.96	52.17	43.48	34.78	34.78	78.26	56.52	47.83	60.87	43.48	52.17	43.48	8.7	39.13	34.78	34.78	17.39
Other	Interior-Architecture-and-Design	79	61	50	44	52	76	53	55	55	48	53	32	28	49	29	36.67	25
Other	Marketing	66	56	52	51	54	70	47	50	57	56	57	54	33	54	39	46	22
Other	Patent	66.67	49.02	9.8	23.53	45.1	62.75	47.06	54.9	37.25	19.61	43.14	31.37	41.18	47.06	25.49	25.49	37.25
Other	Public-Safety	67	43	46	42.7	41	60	31	30	44	40	35	20	17	31	19	25	17
Other	Real-Estate	52.81	40.45	35.96	37.08	38.2	55.06	40.45	39.33	49.44	33.71	38.2	32.58	29.21	39.33	20.22	32.58	19.1
Other	Refrigerating-Machinery	85	53	70	63	41	85	45	46	77	71	47	38	25	45	34	27	22
STEM	Biology	81	60	52	39	47	77	48	48	65	55	47	33	12	46	27	26	13
STEM	Chemical-Engineering	86.96	45.65	79	81	40	85	38.75	37.5	81.25	75.56	39	35	26	38	21	34	20
STEM	Chemistry	91	66	80	71	57.78	92.5	52.22	56.67	80	73.33	60	47	30	50	35	45	27
STEM	Civil-Engineering	69	47	56	41	30	66	38	37	50	44	41	41	28	38	23	29	13
STEM	Computer-Science	80	61	63	52	56	80	47	53	70	58	51	43	27	47	31	36	26
STEM	Ecology	61	49	29	28	36	62	35	37	43	35	38	33	25	39	26	25	19
STEM	Electrical-Engineering	75	40	55	50	33	71	36	33	54	51	38	31	15	34	19	24	16
STEM	Information-Technology	82	62	65	67	59	80	47	52	71	64	58	49	34	43	35	44	28
STEM	Materials-Engineering	80.21	60.42	64	46	54	76	48.89	50	65	53.33	50	32	28	43	25	33	24
STEM	Math	92	25	89	86	26	94	31	22	92	91	26	22	25	23	26	22	22
STEM	Mechanical-Engineering	87	57	70	57	39	79	37	37	75	66	41	36	32	37	22	36	22

Open-Source models

supercategory	Phi-4-mini-instruct	Phi-4	Phi-3.5-MoE-instruct	Phi-3.5-mini-instruct	Phi-3-mini-128k-instruct-June	Llama-3.1-8B-Instruct
Applied Science	26.02	24	25.83	27.08	26.17	26.25
HUMSS	23.09	22.88	21.52	20.21	24.38	20.21
Other	23.19	22.73	24.82	23.05	24.82	23.88
STEM	22.25	24.25	28.18	24.36	26.91	24.64
Overall	25.27	24.24	25.34	24	25.68	24.03

🚀 Quick Start

GitHub Codespace

Please start a new project by connecting to Codespace Project. The environment required for hands-on is automatically configured through devcontainer, so you only need to run a Jupyter notebook.

Your Local PC

Please start by installing the required packages on your local PC through uv.

uv sync

To use Jupyter Notebook in VS Code or Cursor, choose one of the two methods below:

Click Select Kernel - Python Environments, and then select .venv.

Click Python: Select Interpreter with the shortcut Ctrl + Shift + P (Windows/Linux), Cmd + Shift + P (macOS), and then select .venv.

Configuration

Multiple Models Setup

For testing multiple models simultaneously, create separate configuration files in the env/ folder:

mkdir env
cp .env.sample env/.env.gpt4
cp .env.sample env/.env.claude
cp .env.sample env/.env.nova

Each file should have different model configurations:

env/.env.gpt4:

MODEL_NAME=gpt-4o
MODEL_VERSION=2024-05-13
AZURE_OPENAI_ENDPOINT=<YOUR_ENDPOINT>
AZURE_OPENAI_API_KEY=<YOUR_API_KEY>

env/.env.claude:

MODEL_NAME=claude-4-5
MODEL_VERSION=2025-12-05
BEDROCK_MODEL_ID=anthropic.claude-3-5-sonnet-20241022-v2:0
AWS_REGION=us-west-2

env/.env.nova:

MODEL_NAME=nova-2-lite
MODEL_VERSION=2025-12-05
BEDROCK_MODEL_ID=us.amazon.nova-2-lite-v1:0
AWS_REGION=us-west-2

The ./run.sh script will automatically detect all .env files in the env/ folder and run evaluations in parallel.

Configuration Options

# Basic info
MODEL_NAME=<YOUR_MODEL_NAME>
MODEL_VERSION=<YOUR_MODEL_VERSION>

# Reasoning Configuration (applies to all providers)
REASONING_ENABLED=true  # Enable reasoning mode for system prompts
REASONING_EFFORT=medium  # none, minimal, low, medium, high

# Wait time between requests (seconds) - helps avoid throttling
WAIT_TIME=30  # Default: 30 seconds, only used when throttling errors occur

Azure OpenAI

AZURE_OPENAI_ENDPOINT=<YOUR_ENDPOINT>
AZURE_OPENAI_API_KEY=<YOUR_API_KEY>
AZURE_OPENAI_DEPLOYMENT_NAME=<YOUR_DEPLOYMENT_NAME>
AZURE_OPENAI_API_VERSION=2025-04-01-preview

Azure AI Foundry

AZURE_AI_INFERENCE_KEY=<YOUR_API_KEY>
AZURE_AI_INFERENCE_ENDPOINT=<YOUR_ENDPOINT>
AZURE_AI_DEPLOYMENT_NAME=Phi-4

Amazon Bedrock

BEDROCK_MODEL_ID=us.amazon.nova-2-lite-v1:0
AWS_REGION=us-west-2

OpenAI

OPENAI_API_KEY=<YOUR_API_KEY>
OPENAI_DEPLOYMENT_NAME=<YOUR_DEPLOYMENT_NAME>

Azure ML

AZURE_ML_DEPLOYMENT_NAME=<YOUR_DEPLOYMENT_NAME>
AZURE_ML_ENDPOINT_URL=<YOUR_ENDPOINT_URL>
AZURE_ML_ENDPOINT_TYPE=<dedicated or serverless>
AZURE_ML_API_KEY=<YOUR_API_KEY>

Hugging Face

HF_API_TOKEN=<YOUR_HF_API_TOKEN>

Running Evaluations

Interactive Mode (Recommended)

Run the interactive script that guides you through the evaluation process:

./run.sh

The script will ask you to:

Select dataset (CLIcK, HAE-RAE, KMMLU, KMMLU-HARD, HRM8K, KoBALT, KorMedMCQA)
Choose debug mode (y/n)
Set batch size (default: 10)
Set max tokens (default: 1500)
Set temperature (default: 0.01)
Set number of workers (default: 4)

Manual Mode

Run individual benchmarks directly:

# Example: CLIcK benchmark
uv run python benchmarks/click_main.py \
    --model_provider bedrock \
    --batch_size 4 \
    --is_debug true \
    --num_debug_samples 20 \
    --num_workers 5 \
    --max_tokens 512 \
    --temperature 0.01

# Available benchmark scripts:
# benchmarks/click_main.py
# benchmarks/haerae_main.py  
# benchmarks/kmmlu_main.py
# benchmarks/hrm8k_main.py
# benchmarks/kobalt_main.py
# benchmarks/kormedmcqa_main.py

Available Parameters

--is_debug              # Enable debug mode (default: True)
--num_debug_samples     # Number of samples in debug mode (default: 20)
--model_provider        # Provider: azureopenai, bedrock, openai, azureml, azureaifoundry, huggingface
--batch_size            # Batch size for processing (default: 10)
--max_retries           # Maximum retry attempts (default: 3)
--max_tokens            # Maximum tokens in response (default: 256)
--temperature           # Sampling temperature (default: 0.01)
--template_type         # Prompt template type (default: basic)
--wait_time             # Wait time between requests (default: 1.0)
--num_workers           # Number of parallel workers for processing (default: 4)
--num_shots             # Number of few-shot examples for KMMLU (default: 0)
--is_hard               # Use KMMLU-HARD dataset (default: False)
--subset                # HRM8K subset: GSM8K, MATH, OMNI_MATH, MMMLU, KSM (default: GSM8K)
--categories            # Specific categories to evaluate (optional)

Output

Evaluation results are saved in:

./results/ - Detailed CSV results for each model and dataset
./evals/ - Aggregated evaluation metrics

📚 References

Expand...

@misc{kim2024click,
      title={CLIcK: A Benchmark Dataset of Cultural and Linguistic Intelligence in Korean}, 
      author={Eunsu Kim and Juyoung Suk and Philhoon Oh and Haneul Yoo and James Thorne and Alice Oh},
      year={2024},
      eprint={2403.06412},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@misc{son2024haeraebenchevaluationkorean,
      title={HAE-RAE Bench: Evaluation of Korean Knowledge in Language Models}, 
      author={Guijin Son and Hanwool Lee and Suwan Kim and Huiseo Kim and Jaecheol Lee and Je Won Yeom and Jihyu Jung and Jung Woo Kim and Songseong Kim},
      year={2024},
      eprint={2309.02706},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2309.02706}, 
}

@misc{son2024kmmlumeasuringmassivemultitask,
      title={KMMLU: Measuring Massive Multitask Language Understanding in Korean}, 
      author={Guijin Son and Hanwool Lee and Sungdong Kim and Seungone Kim and Niklas Muennighoff and Taekyoon Choi and Cheonbok Park and Kang Min Yoo and Stella Biderman},
      year={2024},
      eprint={2402.11548},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2402.11548}, 
}

@misc{ko2025hrm8k,
      title={HRM8K: A Bilingual Math Reasoning Benchmark for Korean and English}, 
      author={Hyunwoo Ko and Guijin Son and Dasol Choi},
      year={2025},
      eprint={2501.02448},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.02448}, 
}

@misc{lee2025kobalt,
      title={KoBALT: A Benchmark for Evaluating Korean Linguistic Phenomena in Large Language Models}, 
      author={Dohyun Lee and Seunghyun Hwang and Seungtaek Choi and Hwisang Jeon and Sohyun Park and Sungjoon Park and Yungi Kim},
      year={2025},
      eprint={2505.16125},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.16125}, 
}

@misc{kweon2024kormedmcqa,
      title={KorMedMCQA: Multi-Choice Question Answering Benchmark for Korean Healthcare Professional Licensing Examinations}, 
      author={Sunjun Kweon and Jiyoun Kim and Sujeong Im and Eunbyeol Cho and Seongsu Bae and Jungwoo Oh and Gyubok Lee and Jong Hak Moon and Seng Chan You and Seungjin Baek and Chang Hoon Han and Yoon Bin Jung and Yohan Jo and Edward Choi},
      year={2024},
      eprint={2403.01469},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2403.01469}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
.devcontainer		.devcontainer
benchmarks		benchmarks
charts		charts
config		config
core		core
env		env
imgs		imgs
mapping		mapping
results		results
util		util
utils		utils
.env.sample		.env.sample
.gitignore		.gitignore
.python-version		.python-version
DETAILED_RESULTS.md		DETAILED_RESULTS.md
LICENSE		LICENSE
README.md		README.md
evaluate-all.ipynb		evaluate-all.ipynb
evaluate.ipynb		evaluate.ipynb
kill.sh		kill.sh
pyproject.toml		pyproject.toml
radar_chart_visualization.ipynb		radar_chart_visualization.ipynb
requirements.txt		requirements.txt
run.sh		run.sh

License

daekeun-ml/evaluate-llm-on-korean-dataset

Folders and files

Latest commit

History

Repository files navigation

Korean language proficiency evaluation for LLM/SLM models using KMMLU, CLIcK, HAE-RAE, HRM8K, KoBALT, and KorMedMCQA dataset

📋 Overview

CLIcK (Cultural and Linguistic Intelligence in Korean)

HAE_RAE_BENCH 1.0

KMMLU

KMMLU-HARD

HRM8K

KoBALT-700

KorMedMCQA

🆕 What's New

⚙️ Implementation

📊 Visualization

Radar Chart Generator

Radar Charts

📈 Results

Notes

Model Version

CLIcK

Proprietary models

Accuracy by supercategory

Accuracy by category

Open-Source models

Accuracy by supercategory

Accuracy by category

HAE_RAE_BENCH 1.0

Proprietary models

Open-Source models

KMMLU (0-shot)

Proprietary models

Accuracy by supercategory

Accuracy by category

Open-Source models

KMMLU-HARD (0-shot)

Proprietary models

Accuracy by supercategory

Accuracy by category

Open-Source models

🚀 Quick Start

GitHub Codespace

Your Local PC

Configuration

Multiple Models Setup

Configuration Options

Azure OpenAI

Azure AI Foundry

Amazon Bedrock

OpenAI

Azure ML

Hugging Face

Running Evaluations

Interactive Mode (Recommended)

Manual Mode

Available Parameters

Output

📚 References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages