feat/Add CL-bench (tencent/CL-bench) benchmark by XChen-Zero · Pull Request #1191 · modelscope/evalscope

XChen-Zero · 2026-02-05T08:08:08Z

This PR adds CL-bench support to EvalScope by registering cl_bench and loading tencent/CL-bench from HuggingFace, running inference from the dataset’s OpenAI-style messages and reporting rubric-based mean_acc via LLM-as-a-judge. I tested with gpt-5.1 on 10 samples (limit=10) and got mean_acc = 0.10 on the default subset.

Test config (TaskConfig):

task_cfg = TaskConfig(
    model='gpt-5.1',
    api_url=os.getenv('OPENAI_API_BASE'),
    api_key=os.getenv('OPENAI_API_KEY'),
    eval_type='openai_api',
    eval_batch_size=10,
    datasets=['cl_bench'],
    dataset_args={
        'cl_bench': {
            'dataset_id': "tencent/CL-bench",
        }
    },
    generation_config={
        'max_tokens': 128000,
        'temperature': 1.0,
        'top_p': 1.0,
    },
    work_dir="cl_bench_test",
    judge_model_args={
        "api_key": os.getenv('OPENAI_API_KEY'),
        "api_url": os.getenv('OPENAI_API_BASE'),
        "model_id": "gpt-5.1",
        "generation_config": {
            "max_tokens": 128000,
            "temperature": 1.0,
            "top_p": 1.0,
        }
    },
    analysis_report=True,
    limit=10
)

Result summary:

+---------+-----------+----------+----------+-----+-------+---------+
| Model   | Dataset   | Metric   | Subset   | Num | Score | Cat.0   |
+---------+-----------+----------+----------+-----+-------+---------+
| gpt-5.1 | cl_bench  | mean_acc | default  | 10  | 0.1   | default |
+---------+-----------+----------+----------+-----+-------+---------+

gemini-code-assist · 2026-02-05T08:08:25Z

Summary of Changes

Hello @XChen-Zero, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for the CL-bench benchmark into the EvalScope framework. It establishes a new adapter to seamlessly integrate the tencent/CL-bench dataset from HuggingFace, process its data into OpenAI-style chat messages for model inference, and critically, implement an LLM-as-a-judge evaluation strategy. This judge uses a detailed, rubric-based grading prompt to assess model responses, providing a robust method for calculating and reporting mean_acc for instruction-following and reasoning tasks.

Highlights

CL-bench Benchmark Integration: Adds comprehensive support for the Tencent/CL-bench benchmark within the EvalScope framework, enabling evaluation of instruction-following and reasoning tasks.
LLM-as-a-Judge Evaluation: Implements a sophisticated LLM-as-a-judge mechanism for scoring, utilizing a detailed, rubric-based grading prompt to assess model responses and calculate mean_acc.
HuggingFace Dataset Loading: Enables direct loading of the tencent/CL-bench dataset from HuggingFace, streamlining data acquisition for evaluations.
OpenAI-style Message Processing: Adapts dataset messages to OpenAI-style chat messages, ensuring compatibility with various LLM APIs for inference.
Utility Functions for Rubric Processing: Introduces helper functions to format rubrics into human-readable text and to robustly extract JSON responses from the LLM judge's output.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

evalscope/benchmarks/cl_bench/cl_bench_adapter.py
- Added a new benchmark adapter for CL-bench, registering it with EvalScope.
- Configured the adapter to use LLM-as-a-judge for evaluation by setting _use_llm_judge to True.
- Implemented logic within the load method to handle dataset split loading, with a fallback mechanism for errors.
- Defined the record_to_sample method to convert raw dataset records into Sample objects, extracting messages, rubrics, and metadata.
- Implemented the llm_match_score method, which constructs a detailed grading prompt for the LLM judge, parses its JSON output, and calculates the accuracy score based on the judge's 'Overall Score'.
evalscope/benchmarks/cl_bench/utils.py
- Added a new utility file containing helper functions specific to CL-bench.
- Introduced build_rubrics_text to create a formatted string from a list of rubrics for use in LLM prompts.
- Added extract_json_block to reliably extract JSON content from a string, including handling markdown code block wrappers.

Activity

No specific activity (comments, reviews, progress) was provided for this pull request.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for the CL-bench benchmark, which is a valuable addition. The implementation is well-structured, including a new data adapter and utility functions. My review focuses on improving maintainability and robustness. I've identified an opportunity to simplify the load method in CLBenchAdapter, which is currently more complex than necessary and contains unreachable code. Additionally, I've suggested a more robust implementation for the extract_json_block utility function to better handle variations in LLM-generated output. These changes should make the new benchmark integration more reliable and easier to maintain.

gemini-code-assist · 2026-02-05T08:10:05Z

evalscope/benchmarks/cl_bench/utils.py

+def extract_json_block(result_text: str) -> Optional[str]:
+    # Remove code block wrapper if present
+    if result_text.startswith('```json'):
+        result_text = result_text[7:]
+    if result_text.startswith('```'):
+        result_text = result_text[3:]
+    if result_text.endswith('```'):
+        result_text = result_text[:-3]
+    result_text = result_text.strip()
+
+    return result_text


The current implementation of extract_json_block is not very robust. It uses a series of if statements instead of elif, which could lead to unexpected behavior. It also makes fragile assumptions about the input string (e.g., no surrounding text, no spaces in ````json`), which might not hold true for all LLM outputs. A more robust approach using regular expressions, similar to the official CL-bench evaluation script, would be better. This would handle variations in code block formatting and extract the JSON content more reliably.

Please also add import re at the top of the file to support this change.

Suggested change

def extract_json_block(result_text: str) -> Optional[str]:

# Remove code block wrapper if present

if result_text.startswith('```json'):

result_text = result_text[7:]

if result_text.startswith('```'):

result_text = result_text[3:]

if result_text.endswith('```'):

result_text = result_text[:-3]

result_text = result_text.strip()

return result_text

def extract_json_block(result_text: str) -> Optional[str]:

"""Extracts a JSON block from a string."""

text = result_text.strip()

# First, attempt to find a JSON block enclosed in triple backticks

match = re.search(r'```(?:json)?\s*([\s\S]*?)\s*```', text)

if match:

return match.group(1).strip()

# If no backticks, try to find content between the first '{' and the last '}'

first_brace_index = text.find('{')

last_brace_index = text.rfind('}')

if first_brace_index != -1 and last_brace_index > first_brace_index:

return text[first_brace_index:last_brace_index + 1]

# As a last resort, return the stripped text, assuming it might be a raw JSON string.

return text

gemini-code-assist · 2026-02-05T08:10:05Z

evalscope/benchmarks/cl_bench/cl_bench_adapter.py

+    def load(self):
+        last_error = None
+        original_split = self.eval_split
+        candidate_splits = [original_split]  # only eval on the original split, but or subsequent adaptation, if needed
+        seen = set()
+        for split in candidate_splits:
+            if not split or split in seen:
+                continue
+            seen.add(split)
+            try:
+                self.eval_split = split
+                return super().load()
+            except Exception as exc:
+                last_error = exc
+                logger.warning(f'Failed to load CL-bench split "{split}": {exc}')
+        if original_split:
+            self.eval_split = original_split
+        if last_error:
+            raise last_error
+        return super().load()


The load method's implementation is overly complex for its current functionality. The candidate_splits list contains only one element, making the loop and the seen set redundant. Furthermore, the final return super().load() statement is unreachable under normal circumstances, which can be confusing. The logic can be greatly simplified to a try-except block, which would be more readable and maintainable.

Suggested change

def load(self):

last_error = None

original_split = self.eval_split

candidate_splits = [original_split] # only eval on the original split, but or subsequent adaptation, if needed

seen = set()

for split in candidate_splits:

if not split or split in seen:

continue

seen.add(split)

try:

self.eval_split = split

return super().load()

except Exception as exc:

last_error = exc

logger.warning(f'Failed to load CL-bench split "{split}": {exc}')

if original_split:

self.eval_split = original_split

if last_error:

raise last_error

return super().load()

def load(self):

try:

return super().load()

except Exception as exc:

logger.warning(f'Failed to load CL-bench split "{self.eval_split}": {exc}')

raise

XChen-Zero · 2026-02-05T10:11:12Z

More results

+---------+-----------+----------+----------+-------+---------+---------+
| Model   | Dataset   | Metric   | Subset   |   Num |   Score | Cat.0   |
+=========+===========+==========+==========+=======+=========+=========+
| gpt-5.1 | cl_bench  | mean_acc | default  |   100 |    0.17 | default |
+---------+-----------+----------+----------+-------+---------+---------+

Yunnglin · 2026-02-06T06:49:21Z

Thank you for your PR, could you fix the lint issues? Run the following command:

pip install pre-commit
pre-commit install
pre-commit run --all-files

Yunnglin

I fixed the Lint issues, avoided the handling of quotes, and added relevant documentation. Thank you for your PR, it can be merged.

XChen-Zero added 2 commits February 5, 2026 15:16

Add CL-bench benchmark adapter

0164d0e

Add CL-bench benchmark adapter

314c733

gemini-code-assist bot reviewed Feb 5, 2026

View reviewed changes

XChen-Zero and others added 3 commits February 6, 2026 14:56

fix: fix lint issues

c9a78a1

style: fix string quotes

9c69eae

udpate

f52abd8

Yunnglin approved these changes Feb 7, 2026

View reviewed changes

Yunnglin merged commit 4212b2f into modelscope:main Feb 7, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat/Add CL-bench (tencent/CL-bench) benchmark#1191

feat/Add CL-bench (tencent/CL-bench) benchmark#1191
Yunnglin merged 5 commits intomodelscope:mainfrom
XChen-Zero:feat/add_clbench

XChen-Zero commented Feb 5, 2026

Uh oh!

gemini-code-assist bot commented Feb 5, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 5, 2026

Uh oh!

gemini-code-assist bot Feb 5, 2026

Uh oh!

XChen-Zero commented Feb 5, 2026

Uh oh!

Yunnglin commented Feb 6, 2026

Uh oh!

Yunnglin left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

XChen-Zero commented Feb 5, 2026

Uh oh!

gemini-code-assist bot commented Feb 5, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

XChen-Zero commented Feb 5, 2026

Uh oh!

Yunnglin commented Feb 6, 2026

Uh oh!

Yunnglin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants