Skip to content

Conversation

@ayush1298
Copy link
Collaborator

@ayush1298 ayush1298 commented Jan 3, 2026

closes #3258

Copilot AI review requested due to automatic review settings January 3, 2026 19:40
@ayush1298
Copy link
Collaborator Author

ayush1298 commented Jan 3, 2026

Just added csv for now. Will remove memory usage column, add active embedding to ModelMeta of each model and then add column to LB.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds a CSV file containing model parameter data for models in the MTEB leaderboard. The file documents total parameters, active parameters, and input embedding parameters for over 500 models. This addresses issues #3258 and #3259.

Key Changes

  • Addition of a CSV file documenting parameter counts (total, active, and input embedding) for 554 models
  • The CSV includes success/error status for each model, tracking whether parameter extraction succeeded or encountered issues

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ayush1298
Copy link
Collaborator Author

ayush1298 commented Jan 4, 2026

@Samoed @KennethEnevoldsen @isaac-chung
1 doubt that I have is, value of n_parameters (total number of parameters) which we have in ModelMeta, is almost equal to(though not exact) same as total_params + embedding_params counted using code.

For model: Alibaba-NLP/gte-Qwen2-7B-instruct

  ModelMeta:  7,613,000,000
  In CSV:  total_params - 7069121024, active_params - 6525621760, embedding_params - 543499264 
so, total_params + embedding_params = 7,612,620,288 which is almost equal to what we have in ModelMeta.

Also in csv these 3 parameters came from below code.

import numpy as np
from transformers import AutoModel

model_name = "google/embeddinggemma-300m"
model = AutoModel.from_pretrained(model_name)

input_params = np.prod(model.get_input_embeddings().weight.shape)
total_params = np.sum([np.prod(x.shape) for x in list(model.parameters())])
active_params = total_params - input_params

So, what exactly, I should add in ModelMeta? Should I add 2 new field in ModelMeta with name active_parameter, and embedding_parameter, and should we show both of these on LB, and remove memory_usage and number of parameters?

@Samoed
Copy link
Member

Samoed commented Jan 4, 2026

Let's modify leaderboard in separate PR

1 doubt that I have is, value of n_parameters (total number of parameters) which we have in ModelMeta, is almost equal to(though not exact) same as total_params + embedding_params counted using code.

Yes, this is expected that total parameters = active parameters + embedding paramters

Should I add 2 new field in ModelMeta with name active_parameter, and embedding_parameter

Probably yes

should we show both of these on LB, and remove memory_usage and number of parameters?

Let's discuss this in issue

@Samoed Samoed changed the title Add active parameter column on LB Add active parameter to ModelMeta Jan 4, 2026
@KennethEnevoldsen KennethEnevoldsen marked this pull request as draft January 4, 2026 11:48
@KennethEnevoldsen
Copy link
Contributor

Converting this to a draft given the current state, but great to see it!

  • Should we have a script to compute active parameters (e.g. the automatic metadata). Should we consider how we handle MoE? (this is where active != total - embedding)
  • Leaderboard: Let us deal with that in a seperate PR
  • embedding parameters: If we define active parameters as total - embedding (might be a good enough assumption) I would probably just add embedding parameters and let the other one be a property. We could potentially allow an overwrite using e.g. self._active_parameters

@ayush1298
Copy link
Collaborator Author

  • Should we have a script to compute active parameters (e.g. the automatic metadata). Should we consider how we handle MoE? (this is where active != total - embedding)

Yes, I will be adding the logic to calculate automatically in from_hub method. I think MOE are case, where I was not able to calculate it. Hence I think atleast in ModelMeta we can keep n_parameters and add both active and embedding. Also, as I have already calculated these parameters for around 315 models, and around 60 models are propritory out of 498 models we have currently.

  • embedding parameters: If we define active parameters as total - embedding (might be a good enough assumption) I would probably just add embedding parameters and let the other one be a property. We could potentially allow an overwrite using e.g. self._active_parameters

For now, we can use csv. But, I think its better to keep both in ModelMeta as then we could also handle it for MOE case, where these property will not be useful.

@ayush1298
Copy link
Collaborator Author

I have updated files with these meta and also added it to .from_hub() method. I have also added script that I have used to update ModelMeta. Will remove these script at end.

@ayush1298
Copy link
Collaborator Author

ayush1298 commented Jan 4, 2026

Previously when I was doing make lint, it was fixing errors itself (before adding uv everywhere), but now somehow it was just giving error, but not fixing them. Why this was happening?

@Samoed
Copy link
Member

Samoed commented Jan 4, 2026

Strange. I don't have this behavior

@KennethEnevoldsen
Copy link
Contributor

Hmm do we want to use the csv? I think we should rather use the ModelMeta object - if not then I think we should use the json format similar to what we do with descriptive statistics

@Samoed
Copy link
Member

Samoed commented Jan 4, 2026

I think csv just for demonstration

@ayush1298
Copy link
Collaborator Author

ayush1298 commented Jan 5, 2026

Hmm do we want to use the csv? I think we should rather use the ModelMeta object - if not then I think we should use the json format similar to what we do with descriptive statistics

Its just for demonstration of all results I have calculated, and I had filled ModelMeta of models using that csv.
For future if someone wants to calculate or add it for models where I am not able to calculate, then can just do it in below way:

model_meta = mteb.get_model_meta("model_name")
active_parameters, embedding_parameters  = model_meta.extract_parameter_breakdown_from_hub

@ayush1298 ayush1298 marked this pull request as ready for review January 5, 2026 11:50
Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would convert active parameters to a property and just add embedding parameters.

@ayush1298
Copy link
Collaborator Author

Converted active_parameters to property. I will check for which models n_parameter != embedding + active based on results in csv, and only for them will keep _n_active_parameters field in ModelMeta and will remove these field for rest of all models.

@ayush1298 ayush1298 force-pushed the add_active_parameter branch from 5eda71b to 3db3369 Compare January 6, 2026 14:38
@ayush1298
Copy link
Collaborator Author

@KennethEnevoldsen @Samoed How do I actually check for MOE, here problem is, n_parameters which we already have will not be exactly matching with embedding + active as n_parameters was reported and not excatly calculated by loading model, So, how I should decide which one is MOE, I tried keeping some threshold between difference of embedding + active value and n_parameters value, but I am not sure how correct it is to consider MOE by that way?

@Samoed
Copy link
Member

Samoed commented Jan 7, 2026

n_parameters which we already have will not be exactly matching with embedding + active

Yes, but you can set _n_active_parameters and it will be returned correctly.

So, how I should decide which one is MOE, I tried keeping some threshold between difference of embedding + active value and n_parameters value, but I am not sure how correct it is to consider MOE by that way?

I'm not sure about automating it. I think model authors can submit their score by themselves. You can try to test this on nomic-ai/nomic-embed-text-v2-moe

@ayush1298
Copy link
Collaborator Author

ayush1298 commented Jan 7, 2026

n_parameters which we already have will not be exactly matching with embedding + active

Yes, but you can set _n_active_parameters and it will be returned correctly.

But, kenneth want to have it only for MOE models in ModelMeta, and for rest of models it should not be present. So, I was thinking how to get MOE. I am okay with keeping it for all models in ModelMeta, but then what's point of keeping it as a property? though it will be used for future models where results are not reported.

So, how I should decide which one is MOE, I tried keeping some threshold between difference of embedding + active value and n_parameters value, but I am not sure how correct it is to consider MOE by that way?

I'm not sure about automating it. I think model authors can submit their score by themselves. You can try to test this on nomic-ai/nomic-embed-text-v2-moe

So, I think the best way is to keep _n_active_parameters in all ModelMeta for now. In the future, if results are not reported then property will be used in that case.

@ayush1298
Copy link
Collaborator Author

1 more question in ModelMeta, it can't be named as _n_active_parameters, right as it becomes private variable, we have to name it as active_parameters or n_active_parameters?

@Samoed
Copy link
Member

Samoed commented Jan 7, 2026

But, kenneth want to have it only for MOE models in ModelMeta

Yes, you shouldn't fill it for other models, I think. I don't see a problem here.

we have to name it as active_parameters or n_active_parameters?

I think you can name change _n_active_parameters to n_active_parameters and name property as active_parameters

@ayush1298
Copy link
Collaborator Author

But, kenneth want to have it only for MOE models in ModelMeta

Yes, you shouldn't fill it for other models, I think. I don't see a problem here.

Yes, problem is just to identify which models are MOE.

@ayush1298
Copy link
Collaborator Author

ayush1298 commented Jan 7, 2026

I have added _calculate_embedding_parameters to from_cross_encoder and from_senetence_tranformer. Do we have to keep calculate_embedding_parameters to access it directly for someone or not is yet to decide.
I have also updated model2vec models with n_embedding_parameters=n_parameters

As per my analysis, below are list of model2vec models, let me know if there is any model which is not, or are there any other model2vec models:

  1. NeuML/pubmedbert-base-embeddings-100K
  2. NeuML/pubmedbert-base-embeddings-1M
  3. NeuML/pubmedbert-base-embeddings-2M
  4. NeuML/pubmedbert-base-embeddings-500K
  5. NeuML/pubmedbert-base-embeddings-8M
  6. minishlab/M2V_base_glove
  7. minishlab/M2V_base_glove_subword
  8. minishlab/M2V_base_output
  9. minishlab/M2V_multilingual_output
  10. minishlab/potion-base-2M
  11. minishlab/potion-base-4M
  12. minishlab/potion-base-8M
  13. minishlab/potion-multilingual-128M
  14. rasgaard/m2v-dfm-large

@ayush1298
Copy link
Collaborator Author

I don't know why a typecheck error is coming; I have also added explicit conditions for that. And I have also tested code for both sentence-transformer and cross-encoder models. Also, there are some conflicts in dependencies.


meta = cls._from_hub(model.model.name_or_path, revision, compute_metadata)
try:
if isinstance(model.model, PreTrainedModel):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this check is necessary because you'll get valid cross-encoder model

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I though typecheck is failing because it was not able to identify this one. So, I have added explicit checks.

meta = cls._from_hub(name, revision, compute_metadata)
try:
first_module = model[0]
if hasattr(first_module, "auto_model"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need these checks?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add active/embedding parameters separately

3 participants