-
Notifications
You must be signed in to change notification settings - Fork 506
refactor: split BRIGHT benchmark into individual subset tasks
#3285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
refactor: split BRIGHT benchmark into individual subset tasks
#3285
Conversation
BRIGHT benchmark into individual subset tasks
4240bdb to
826990a
Compare
KennethEnevoldsen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this change will invalidate all previous results on BRIGHT.
You know that you can also simply subselect from a task using:
task = mteb.get_task("BrightRetrieval", eval_splits=..., hf_subet=...)
For the leaderboard display it is even possible to create custom summary tables (see e.g. #3272)
Yes, but |
Ohh... Yeah that is hard to fix. I see that the original BRIGHT(long) only has four models and BRIGHT only has 12, so I guess it is possible to rerun them |
|
If the scores change, are the new scores more similar or more different from the official scores? If closer then I think it is fine & maybe we can rerun some models. I think that for many models on our BRIGHT leaderboard I just converted the scores from https://brightbenchmark.github.io/ to MTEB format when we originally added so they may be still fine if these changes actually make our implementation closer to that one. |
|
Would it be enough to evaluate the performance of ReasonIR, or is there a list of other models that would be good enough to test? |
|
To check implementation, this will be enough, just don't update old leaderboard |
826990a to
3ed620f
Compare
3ed620f to
57c757f
Compare
|
After split import torch
import mteb
# https://github.com/facebookresearch/ReasonIR/tree/main/evaluation/bright/configs/reasonir
prompts_dict = {
"BrightBiologyRetrieval": "Given a Biology post, retrieve relevant passages that help answer the post",
"BrightEarthScienceRetrieval": "Given a Earth Science post, retrieve relevant passages that help answer the post",
"BrightEconomicsRetrieval": "Given a Economics post, retrieve relevant passages that help answer the post",
"BrightPsychologyRetrieval": "Given a Psychology post, retrieve relevant passages that help answer the post",
"BrightRoboticsRetrieval": "Given a Robotics post, retrieve relevant passages that help answer the post",
"BrightStackoverflowRetrieval": "Given a Stackoverflow post, retrieve relevant passages that help answer the post",
"BrightSustainableLivingRetrieval": "Given a Sustainable Living post, retrieve relevant passages that help answer the post",
"BrightPonyRetrieval": "Given a Pony question, retrieve relevant passages that help answer the question",
"BrightLeetcodeRetrieval": "Given a coding problem, retrieve relevant examples that help answer the problem",
"BrightAopsRetrieval": "Given a Math problem, retrieve relevant examples that help answer the problem",
"BrightTheoremQATheoremsRetrieval": "Given a Math problem, retrieve relevant theorems that help answer the problem",
"BrightTheoremQAQuestionsRetrieval": "Given a Math problem, retrieve relevant examples that help answer the problem",
}
tasks = mteb.get_tasks(tasks=list(prompts_dict.keys()), languages=["eng"])
evaluation = mteb.MTEB(tasks=tasks)
model = mteb.get_model(
"ReasonIR/ReasonIR-8B",
model_kwargs={"torch_dtype": torch.bfloat16},
prompts_dict=prompts_dict,
)
evaluation.run(
model,
save_predictions=True,
output_folder="evaluation/results",
encode_kwargs={"batch_size": 1},
)The results are as follows:
|
|
Great results! But I'm a bit unsure does prompts applied correctly when they're passing thought |
|
mteb/mteb/models/instruct_wrapper.py Lines 158 to 171 in d2c704c
After adding code to print the instruction inside the code, the following output was produced: |
|
Interesting, thanks! I didn’t think that would work since it’s a bit unintended, but maybe we should update the code to handle this case. I've checked code for ReasonIR and found some other places that can help to reproduce:
@Muennighoff Can you help what we can do to reproduce results? |
|
I think the IDs filtering is probably the main missing piece to fully reproduce results? |
|
I think points 1 and 2 are a separate issue, as they are related to query expansion. The problem of the performance not being reproducible in the single |
# Conflicts: # mteb/benchmarks/benchmarks/__init__.py # mteb/tasks/Retrieval/__init__.py # mteb/tasks/retrieval/eng/BrightSubsetsLongRetrieval.py # mteb/tasks/retrieval/eng/BrightSubsetsRetrieval.py
|
I think it would be better to close this PR and work on it later together with Excluded IDs missing from BRIGHT dataset #2696. Also, should revise it to fit the v2 format and include descriptive stats as well. What do you think? |
Do you mean that you don't want tasks in this pr and will add another PR for #2696?
Yes, you need to add statistic to merge. To apply |
|
What tasks need to be redone for this PR? I'm confused about the changes with the v2 format, so I would appreciate your help. |
|
I think we can solve #2696 in this pr, because otherwise we would need to create v2 versions of these tasks, which I think is not good solution |
| domain_corpus_long = datasets.load_dataset( | ||
| path, | ||
| "long_documents", | ||
| split=domain, | ||
| cache_dir=cache_dir, | ||
| revision=revision, | ||
| ) | ||
| examples = datasets.load_dataset( | ||
| path, | ||
| "examples", | ||
| split=domain, | ||
| cache_dir=cache_dir, | ||
| revision=revision, | ||
| ) | ||
| corpus["long"] = {e["id"]: {"text": e["content"]} for e in domain_corpus_long} | ||
| queries["long"] = {e["id"]: e["query"] for e in examples} | ||
| relevant_docs["long"] = defaultdict(dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To follow v2 format, you can remove conversion dataset to dict and pass dataset directly.
domain_corpus_long = domain_corpus_long.rename_column("content", "text")
queries = queries.rename_column("query", "text")
...
return domain_corpus_long, queires, relevant_docs
| if self.data_loaded: | ||
| return | ||
|
|
||
| self.corpus, self.queries, self.relevant_docs = load_bright_long_data( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And then here it should look like
self.dataset["default"]["long"]["corpus"], self.dataset["default"]["long"]["queries"], self.dataset["default"]["long"]["relevant_documents"]
You can refer to
mteb/mteb/abstasks/retrieval_dataset_loaders.py
Lines 25 to 38 in 0ead029
| class RetrievalSplitData(TypedDict): | |
| """A dictionary containing the corpus, queries, relevant documents, instructions, and top-ranked documents for a retrieval task. | |
| Attributes: | |
| corpus: The corpus dataset containing documents. Should have columns `id`, `title`, `text` or `image`. | |
| queries: The queries dataset containing queries. Should have columns `id`, `text`, `instruction` (for instruction retrieval/reranking) or `image`. | |
| relevant_docs: A mapping of query IDs to relevant document IDs and their relevance scores. Should have columns `query-id`, `corpus-id`, `score`. | |
| top_ranked: A mapping of query IDs to a list of top-ranked document IDs. Should have columns `query-id`, `corpus-ids` (list[str]). This is optional and used for reranking tasks. | |
| """ | |
| corpus: CorpusDatasetType | |
| queries: QueryDatasetType | |
| relevant_docs: RelevantDocumentsType | |
| top_ranked: TopRankedDocumentsType | None |
ef52c84 to
f95a246
Compare
|
Great! So for now most different task is pony? |
|
Among the tasks with excluded_ids, pony seems to be the most different. The other tasks seem to have reproduced the performance reported in the paper to some extent. |
I think the main difference because of that you've evaluated |
|
Scores looking really close, great work. Are you asking me whether in the paper they were evaluated with shots or without? |
Yes |
|
Yeah I think those specific paper results are zero-shot |
|
I set |
|
Yeah that seems right to me (cc'ing @RulinShao in case she has thoughts on if we're missing sth for full reproduction or scores seem close enough) |
|
I made a mistake by omitting a newline ( Also, I would like to ask if it would be better to handle this modification as a new PR. mteb/mteb/models/model_implementations/reasonir_model.py Lines 16 to 25 in 976fadf
# https://github.com/facebookresearch/ReasonIR/blob/main/evaluation/bright/configs/reasonir/biology.json
{
"instructions": {
"query": "<|user|>\nGiven a {task} post, retrieve relevant passages that help answer the post\n<|embed|>\n",
"document": "<|embed|>\n"
},
"instructions_long": {
"query": "<|user|>\nGiven a {task} post, retrieve relevant documents that help answer the post\n<|embed|>\n",
"document": "<|embed|>\n"
}
} |
|
I think it would be better to make fix in separate PR |
|
When I look at the original repository, it seems like the The performance was measured based on the following code: import torch
import mteb
import logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
logger = logging.getLogger(__name__)
prompts_dict = {
"BrightBiologyRetrieval": "Given a biology post, retrieve relevant passages that help answer the post",
"BrightEarthScienceRetrieval": "Given a earth_science post, retrieve relevant passages that help answer the post",
"BrightEconomicsRetrieval": "Given a economics post, retrieve relevant passages that help answer the post",
"BrightPsychologyRetrieval": "Given a psychology post, retrieve relevant passages that help answer the post",
"BrightRoboticsRetrieval": "Given a robotics post, retrieve relevant passages that help answer the post",
"BrightStackoverflowRetrieval": "Given a stackoverflow post, retrieve relevant passages that help answer the post",
"BrightSustainableLivingRetrieval": "Given a sustainable_living post, retrieve relevant passages that help answer the post",
"BrightPonyRetrieval": "Given a pony question, retrieve relevant passages that help answer the question",
"BrightLeetcodeRetrieval": "Given a coding problem, retrieve relevant examples that help answer the problem",
"BrightAopsRetrieval": "Given a Math problem, retrieve relevant examples that help answer the problem",
"BrightTheoremQATheoremsRetrieval": "Given a Math problem, retrieve relevant theorems that help answer the problem",
"BrightTheoremQAQuestionsRetrieval": "Given a Math problem, retrieve relevant examples that help answer the problem",
}
model_path = "ReasonIR/ReasonIR-8B"
model_name = model_path.split("/")[-1]
model = mteb.get_model(
"ReasonIR/ReasonIR-8B",
model_kwargs={"torch_dtype": torch.bfloat16},
max_seq_length=32768,
prompts_dict=prompts_dict,
)
cache_dir = "evaluation/cache/bright"
for task_name in prompts_dict.keys():
print(f"task: {task_name}")
tasks = mteb.get_tasks(tasks=[task_name], languages=["eng"])
cache = mteb.cache.ResultCache(cache_dir)
try:
mteb.evaluate(
model,
tasks,
cache=cache,
overwrite_strategy="only-missing",
prediction_folder=f"{cache_dir}/predictions/{model_name.replace('/', '__')}",
encode_kwargs={"batch_size": 1},
)
print(f"{task_name} completed successfully")
torch.cuda.empty_cache()
except torch.cuda.OutOfMemoryError:
print(f"{task_name} skipped due to OOM error")
torch.cuda.empty_cache()
continueThe performance differences are as follows:
I'm not sure if the problem is with |
|
Interestingly, that score on most task dropped. On @whybe-choi You can try to reproduce scores for UPD. In BRIGHT paper I think in main table they reported short version, because they have long version in |
|
To reproduce results for import torch
import mteb
import logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
logger = logging.getLogger(__name__)
prompts_dict = {
"BrightBiologyRetrieval-query": "Represent this biology post for searching relevant passages: ",
"BrightEarthScienceRetrieval-query": "Represent this earth_science post for searching relevant passages: ",
"BrightEconomicsRetrieval-query": "Represent this economics post for searching relevant passages: ",
"BrightPsychologyRetrieval-query": "Represent this psychology post for searching relevant passages: ",
"BrightRoboticsRetrieval-query": "Represent this robotics post for searching relevant passages: ",
"BrightStackoverflowRetrieval-query": "Represent this stackoverflow post for searching relevant passages: ",
"BrightSustainableLivingRetrieval-query": "Represent this sustainable_living post for searching relevant passages: ",
"BrightPonyRetrieval-query": "Represent this Pony question for searching relevant passages: ",
"BrightLeetcodeRetrieval-query": "Represent this Coding problem for searching relevant examples: ",
"BrightAopsRetrieval-query": "Represent this Math problem for searching relevant examples: ",
"BrightTheoremQATheoremsRetrieval-query": "Represent this Math problem for searching relevant theorems: ",
"BrightTheoremQAQuestionsRetrieval-query": "Represent this Math problem for searching relevant examples: ",
}
model_path = 'BAAI/bge-large-en-v1.5'
model_name = model_path.split("/")[-1]
model = mteb.get_model(
model_path,
model_kwargs={"torch_dtype": torch.float32},
tokenizer_kwargs={"max_seq_length": 512},
model_prompts=prompts_dict,
)
cache_dir = "evaluation/cache/bright_v2"
for task_name in prompts_dict.keys():
task_name = task_name.split("-")[0]
print(f"task: {task_name}")
tasks = mteb.get_tasks(tasks=[task_name], languages=["eng"])
cache = mteb.cache.ResultCache(cache_dir)
try:
mteb.evaluate(
model,
tasks,
cache=cache,
overwrite_strategy="only-missing",
prediction_folder=f"{cache_dir}/predictions/{model_name.replace('/', '__')}",
encode_kwargs={"batch_size": 1},
)
print(f"✅ {task_name} completed successfully")
torch.cuda.empty_cache()
except torch.cuda.OutOfMemoryError:
print(f"⚠️ {task_name} skipped due to OOM error")
torch.cuda.empty_cache()
continueThe results are as follows:
|
|
@whybe-choi You didn't add a code |
|
I will try to rerun bge from bright repo |
|
I run bge on earth, biology and pony and got same results as in paper |
|
did I miss anything when evaluating bge model using mteb? |
|
I don't know for now, I will try to dig dipper on weekends |
|
I tried to debug it, but I'm getting slightly different embeddings for texts. For example with mteb I'm getting But in BRIGHT repo I get I even added to Which leads to difference in result similarities, e.g. for ["0"]["Pony/src-math-is_prime-_0.txt"] So, I think scores are close as possible to integrate. I don't know how to reproduce this fully. |
|
I'm sorry to hear that. Is there any additional work I need to do on this PR? |
|
I think you can add prompts to the tasks and make better description for them |
|
I think it’s tricky to add a prompt because the format of the prompt varies for each model. For example, each model uses the following prompt for
|
|
I think you can add prompt like for bge |

Close #3268
This pull request adds new BRIGHT subset benchmarks and their corresponding descriptive statistics to the retrieval benchmark suite. These changes enable more granular, domain-specific evaluation for reasoning-intensive retrieval tasks, both for standard and long document formats.
Benchmark additions
BRIGHT_SUBSETSandBRIGHT_SUBSETS_LONG, to themteb/benchmarks/benchmarks/benchmarks.pyfile, covering individual domains of the BRIGHT benchmark for both standard and long document retrieval tasks. [1] [2]mteb/benchmarks/benchmarks/__init__.pyfile for import and usage. [1] [2]Descriptive statistics
BrightBiologyRetrieval.json,BrightBiologyLongRetrieval.json, etc.), detailing sample counts, text lengths, and relevant document statistics for each domain. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]Minor improvement
BEIR_NLbenchmark description for improved readability.