Skip to content

feat/Add CL-bench (tencent/CL-bench) benchmark#1191

Merged
Yunnglin merged 5 commits intomodelscope:mainfrom
XChen-Zero:feat/add_clbench
Feb 7, 2026
Merged

feat/Add CL-bench (tencent/CL-bench) benchmark#1191
Yunnglin merged 5 commits intomodelscope:mainfrom
XChen-Zero:feat/add_clbench

Conversation

@XChen-Zero
Copy link
Contributor

This PR adds CL-bench support to EvalScope by registering cl_bench and loading tencent/CL-bench from HuggingFace, running inference from the dataset’s OpenAI-style messages and reporting rubric-based mean_acc via LLM-as-a-judge. I tested with gpt-5.1 on 10 samples (limit=10) and got mean_acc = 0.10 on the default subset.

Test config (TaskConfig):

task_cfg = TaskConfig(
    model='gpt-5.1',
    api_url=os.getenv('OPENAI_API_BASE'),
    api_key=os.getenv('OPENAI_API_KEY'),
    eval_type='openai_api',
    eval_batch_size=10,
    datasets=['cl_bench'],
    dataset_args={
        'cl_bench': {
            'dataset_id': "tencent/CL-bench",
        }
    },
    generation_config={
        'max_tokens': 128000,
        'temperature': 1.0,
        'top_p': 1.0,
    },
    work_dir="cl_bench_test",
    judge_model_args={
        "api_key": os.getenv('OPENAI_API_KEY'),
        "api_url": os.getenv('OPENAI_API_BASE'),
        "model_id": "gpt-5.1",
        "generation_config": {
            "max_tokens": 128000,
            "temperature": 1.0,
            "top_p": 1.0,
        }
    },
    analysis_report=True,
    limit=10
)

Result summary:

+---------+-----------+----------+----------+-----+-------+---------+
| Model   | Dataset   | Metric   | Subset   | Num | Score | Cat.0   |
+---------+-----------+----------+----------+-----+-------+---------+
| gpt-5.1 | cl_bench  | mean_acc | default  | 10  | 0.1   | default |
+---------+-----------+----------+----------+-----+-------+---------+

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @XChen-Zero, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for the CL-bench benchmark into the EvalScope framework. It establishes a new adapter to seamlessly integrate the tencent/CL-bench dataset from HuggingFace, process its data into OpenAI-style chat messages for model inference, and critically, implement an LLM-as-a-judge evaluation strategy. This judge uses a detailed, rubric-based grading prompt to assess model responses, providing a robust method for calculating and reporting mean_acc for instruction-following and reasoning tasks.

Highlights

  • CL-bench Benchmark Integration: Adds comprehensive support for the Tencent/CL-bench benchmark within the EvalScope framework, enabling evaluation of instruction-following and reasoning tasks.
  • LLM-as-a-Judge Evaluation: Implements a sophisticated LLM-as-a-judge mechanism for scoring, utilizing a detailed, rubric-based grading prompt to assess model responses and calculate mean_acc.
  • HuggingFace Dataset Loading: Enables direct loading of the tencent/CL-bench dataset from HuggingFace, streamlining data acquisition for evaluations.
  • OpenAI-style Message Processing: Adapts dataset messages to OpenAI-style chat messages, ensuring compatibility with various LLM APIs for inference.
  • Utility Functions for Rubric Processing: Introduces helper functions to format rubrics into human-readable text and to robustly extract JSON responses from the LLM judge's output.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • evalscope/benchmarks/cl_bench/cl_bench_adapter.py
    • Added a new benchmark adapter for CL-bench, registering it with EvalScope.
    • Configured the adapter to use LLM-as-a-judge for evaluation by setting _use_llm_judge to True.
    • Implemented logic within the load method to handle dataset split loading, with a fallback mechanism for errors.
    • Defined the record_to_sample method to convert raw dataset records into Sample objects, extracting messages, rubrics, and metadata.
    • Implemented the llm_match_score method, which constructs a detailed grading prompt for the LLM judge, parses its JSON output, and calculates the accuracy score based on the judge's 'Overall Score'.
  • evalscope/benchmarks/cl_bench/utils.py
    • Added a new utility file containing helper functions specific to CL-bench.
    • Introduced build_rubrics_text to create a formatted string from a list of rubrics for use in LLM prompts.
    • Added extract_json_block to reliably extract JSON content from a string, including handling markdown code block wrappers.
Activity
  • No specific activity (comments, reviews, progress) was provided for this pull request.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the CL-bench benchmark, which is a valuable addition. The implementation is well-structured, including a new data adapter and utility functions. My review focuses on improving maintainability and robustness. I've identified an opportunity to simplify the load method in CLBenchAdapter, which is currently more complex than necessary and contains unreachable code. Additionally, I've suggested a more robust implementation for the extract_json_block utility function to better handle variations in LLM-generated output. These changes should make the new benchmark integration more reliable and easier to maintain.

Comment on lines +20 to +30
def extract_json_block(result_text: str) -> Optional[str]:
# Remove code block wrapper if present
if result_text.startswith('```json'):
result_text = result_text[7:]
if result_text.startswith('```'):
result_text = result_text[3:]
if result_text.endswith('```'):
result_text = result_text[:-3]
result_text = result_text.strip()

return result_text
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation of extract_json_block is not very robust. It uses a series of if statements instead of elif, which could lead to unexpected behavior. It also makes fragile assumptions about the input string (e.g., no surrounding text, no spaces in ````json`), which might not hold true for all LLM outputs. A more robust approach using regular expressions, similar to the official CL-bench evaluation script, would be better. This would handle variations in code block formatting and extract the JSON content more reliably.

Please also add import re at the top of the file to support this change.

Suggested change
def extract_json_block(result_text: str) -> Optional[str]:
# Remove code block wrapper if present
if result_text.startswith('```json'):
result_text = result_text[7:]
if result_text.startswith('```'):
result_text = result_text[3:]
if result_text.endswith('```'):
result_text = result_text[:-3]
result_text = result_text.strip()
return result_text
def extract_json_block(result_text: str) -> Optional[str]:
"""Extracts a JSON block from a string."""
text = result_text.strip()
# First, attempt to find a JSON block enclosed in triple backticks
match = re.search(r'```(?:json)?\s*([\s\S]*?)\s*```', text)
if match:
return match.group(1).strip()
# If no backticks, try to find content between the first '{' and the last '}'
first_brace_index = text.find('{')
last_brace_index = text.rfind('}')
if first_brace_index != -1 and last_brace_index > first_brace_index:
return text[first_brace_index:last_brace_index + 1]
# As a last resort, return the stripped text, assuming it might be a raw JSON string.
return text

Comment on lines +43 to +62
def load(self):
last_error = None
original_split = self.eval_split
candidate_splits = [original_split] # only eval on the original split, but or subsequent adaptation, if needed
seen = set()
for split in candidate_splits:
if not split or split in seen:
continue
seen.add(split)
try:
self.eval_split = split
return super().load()
except Exception as exc:
last_error = exc
logger.warning(f'Failed to load CL-bench split "{split}": {exc}')
if original_split:
self.eval_split = original_split
if last_error:
raise last_error
return super().load()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The load method's implementation is overly complex for its current functionality. The candidate_splits list contains only one element, making the loop and the seen set redundant. Furthermore, the final return super().load() statement is unreachable under normal circumstances, which can be confusing. The logic can be greatly simplified to a try-except block, which would be more readable and maintainable.

Suggested change
def load(self):
last_error = None
original_split = self.eval_split
candidate_splits = [original_split] # only eval on the original split, but or subsequent adaptation, if needed
seen = set()
for split in candidate_splits:
if not split or split in seen:
continue
seen.add(split)
try:
self.eval_split = split
return super().load()
except Exception as exc:
last_error = exc
logger.warning(f'Failed to load CL-bench split "{split}": {exc}')
if original_split:
self.eval_split = original_split
if last_error:
raise last_error
return super().load()
def load(self):
try:
return super().load()
except Exception as exc:
logger.warning(f'Failed to load CL-bench split "{self.eval_split}": {exc}')
raise

@XChen-Zero
Copy link
Contributor Author

More results

+---------+-----------+----------+----------+-------+---------+---------+
| Model   | Dataset   | Metric   | Subset   |   Num |   Score | Cat.0   |
+=========+===========+==========+==========+=======+=========+=========+
| gpt-5.1 | cl_bench  | mean_acc | default  |   100 |    0.17 | default |
+---------+-----------+----------+----------+-------+---------+---------+

@Yunnglin
Copy link
Collaborator

Yunnglin commented Feb 6, 2026

Thank you for your PR, could you fix the lint issues? Run the following command:

pip install pre-commit
pre-commit install
pre-commit run --all-files

Copy link
Collaborator

@Yunnglin Yunnglin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed the Lint issues, avoided the handling of quotes, and added relevant documentation. Thank you for your PR, it can be merged.

@Yunnglin Yunnglin merged commit 4212b2f into modelscope:main Feb 7, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants