Skip to content

Conversation

@yanqingliu35-stack
Copy link
Contributor

@yanqingliu35-stack yanqingliu35-stack commented Oct 27, 2025

This PR introduces native support for evaluating RWKV models

how to use

python run_lm_eval.py /path/to/model

@yanqingliu35-stack yanqingliu35-stack requested a review from a team as a code owner October 27, 2025 02:52
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @yanqingliu35-stack, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces native support for evaluating RWKV language models within a standardized framework. By adding a new script and a custom adapter, it enables seamless integration with the lm_eval harness, allowing for comprehensive benchmarking of RWKV models across various NLP tasks and ensuring that evaluation results are consistently recorded.

Highlights

  • New RWKV Evaluation Script: A new Python script, run_lm_eval.py, has been added to facilitate the evaluation of RWKV models.
  • LM Evaluation Harness Integration: The script integrates RWKV models with the lm_eval evaluation harness, providing a standardized framework for benchmarking.
  • Custom Adapter Implementation: An EvalHarnessAdapter class is implemented to bridge RWKV model functionalities, such as tokenization, log-likelihood calculation, and text generation, with the lm_eval API.
  • Result Persistence: Evaluation results are automatically saved to a JSON file, ensuring easy access and analysis of model performance across tasks.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a new script for evaluating RWKV models using lm-eval. The script provides a good starting point by wrapping the RWKV model in an adapter compatible with the evaluation harness. My review includes several suggestions to improve the script's correctness, robustness, and maintainability. The most critical issue is that the script, as written, will not run any evaluation tasks because the task list is empty. I've also pointed out a bug where a global model variable is used instead of the instance's self.model, and suggested improvements like using dependency injection and enhancing command-line argument parsing for better flexibility.

Comment on lines 37 to 40
eval_tasks = [
#'lambada_openai', 'piqa', 'storycloze_2016', 'hellaswag', 'winogrande',
#'arc_challenge', 'arc_easy', 'headqa_en', 'openbookqa', 'sciq',
#'mmlu','glue']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The eval_tasks list is currently empty because all task names are commented out. This will cause the script to run without performing any evaluations. To make the script functional, you should uncomment the tasks you intend to run.

Additionally, there's a typo on line 40: a full-width comma () is used instead of a standard comma (,), which would cause a SyntaxError if that line is uncommented. I've corrected it in the suggestion.

eval_tasks = [
    'lambada_openai', 'piqa', 'storycloze_2016', 'hellaswag', 'winogrande',
    #'arc_challenge', 'arc_easy', 'headqa_en', 'openbookqa', 'sciq',
    #'mmlu', 'glue']

all_tokens = []
state = None

out, state = model.forward(context_tokens, state)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This line uses the global model variable. For consistency and to avoid potential bugs, it should use self.model like in other methods of this class.

            out, state = self.model.forward(context_tokens, state)

break

all_tokens.append(token)
out, state = model.forward([token], state)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This line uses the global model variable. For consistency and to avoid potential bugs, it should use self.model like in other methods of this class.

                out, state = self.model.forward([token], state)

########################################################################################################
# pip install rwkv lm_eval --upgrade

import os, sys, types, json, math, time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

According to the PEP 8 style guide, it's recommended to have one import per line. This improves code readability and maintainability.

import os
import sys
import types
import json
import math
import time

Comment on lines 27 to 41
if len(sys.argv) < 2:
print("Usage: python your_script_name.py /path/to/your/model.pth")
sys.exit(1)

MODEL_NAME = sys.argv[1]

print(f'Loading model - {MODEL_NAME}')
model = RWKV(model=MODEL_NAME, strategy='cuda fp16')
pipeline = PIPELINE(model, "rwkv_vocab_v20230424")

eval_tasks = [
#'lambada_openai', 'piqa', 'storycloze_2016', 'hellaswag', 'winogrande',
#'arc_challenge', 'arc_easy', 'headqa_en', 'openbookqa', 'sciq',
#'mmlu','glue']
num_fewshot = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The script currently uses basic command-line argument parsing for the model path only, and hardcodes many important parameters like strategy, vocab, eval_tasks, and num_fewshot. This makes the script less flexible. Consider using Python's argparse module to allow users to configure these settings from the command line. This would make the script more powerful and easier to use for different evaluation scenarios.

Comment on lines +51 to +54
def __init__(self):
super().__init__()
self.tokenizer = pipeline.tokenizer
self.model = model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The EvalHarnessAdapter's __init__ method relies on global variables pipeline and model. It's a better practice to pass these dependencies explicitly to the constructor (dependency injection). This improves modularity, reusability, and makes the class easier to test. After this change, you'll also need to update how the adapter is instantiated on line 160 to adapter = EvalHarnessAdapter(model, pipeline.tokenizer).

    def __init__(self, model, tokenizer):
        super().__init__()
        self.tokenizer = tokenizer
        self.model = model

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant