Skip to content

Confused embed_dim across different implementation #1495

@neilwen987

Description

@neilwen987

Hi, I want to post-training a model and test on the MTEB benchmark.
During implementation , I found that load pre-trained model in different way results in different embed_dim.
For example:
model = SentenceTransformer("dunzhang/stella_en_1.5B_v5") results in a embedding output dim =1024, exact as mentioned in https://huggingface.co/dunzhang/stella_en_1.5B_v5.
while using model = mteb.get_model('dunzhang/stella_en_1.5B_v5') rsults in a embedding output dim = 1536.
The difference in embedding dim make me so confused and also the test results also differs due to the embedding dim.

Here's the code I use:

import mteb
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModel
from typing import Dict, List, Optional, Union
from mteb.encoder_interface import PromptType
from mteb.models.wrapper import Wrapper
import numpy as np

class my_model(Wrapper):
def init(self,model_name):
super().init()
# self.model = mteb.get_model(model_name)
self.model = SentenceTransformer(model_name)
def encode(
self,
sentences: list[str],
task_name: str,
prompt_type: PromptType | None = None,
**kwargs,
) -> np.ndarray:
embeddings = self.model.encode(sentences, task_name=task_name)
# print(embeddings.shape)
return embeddings

tasks_name: Dict[str, list] = {
'classification':[
"Banking77Classification",
],
'retrieval':[
'ClimateFEVER',
'DBPedia',
'FEVER',
'QuoraRetrieval',
'SciFact',
'TRECCOVID'
]
}

model = my_model("dunzhang/stella_en_1.5B_v5")

tasks = mteb.get_tasks(
tasks=tasks_name['classification'],
languages=['eng'],
)
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(
model,
eval_splits=["test"],
output_folder="results",
)

Also , another problem is. Under the implementation of MTEB, is the MRL of this model(https://huggingface.co/dunzhang/stella_en_1.5B_v5) also implemented? If so, how can I use it? If not, how to reproduce the MTEB results duo to SentenceTransformer(cause the MRL representation is implemented in SentenceTransformer)?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions