Detection of non-zero-shot models from the annotations we have #2158

x-tabdeveloping · 2025-02-25T11:56:36Z

x-tabdeveloping
Feb 25, 2025
Collaborator

Since we have quite a few annotations on what models were and weren't trained on given datasets, I thought it might make sense to look into whether some patterns could be found and some models could be identified as cases of training on benchmark datasets.

I have rained a random forest classifier from z-score normalized scores and n_parameters to whether the model is zero shot, and then made predictions for all models lacking annotations.

I have a list of models and task, where one could suspect that the model has been trained on the benchmark task.
The results should definitely be interpreted with a pinch of salt.
While some of them seem unlikely (I highly doubt that gte-Qwen1.5-7B-instruct has been trained on DKHateClassification for instance), a lot of these seem very reasonable and confirm my intuition of what some of these might have been trained on.
It is also reassuring that for instance, with Linq embedding, where we know that our annotations were incorrect, some the tasks are marked as having been trained on.

cc. @KennethEnevoldsen @isaac-chung @Samoed @tomaarsen @Muennighoff

Alibaba-NLP/gte-Qwen1.5-7B-instruct:  ['InappropriatenessClassification', 'Robust04InstructionRetrieval', 'SpartQA', 'StackExchangeClusteringP2P']
Alibaba-NLP/gte-Qwen2-1.5B-instruct:  ['AILACasedocs', 'ARCChallenge', 'AlloProfClusteringS2S.v2', 'BigPatentClustering.v2', 'CEDRClassification', 'CLSCluste
ringP2P.v2', 'DKHateClassification', 'KinopoiskClassification', 'Ko-StrategyQA', 'KorSTS', 'MIRACLReranking', 'MedrxivClusteringS2S.v2', 'NLPJournalTitleIntr
oRetrieval', 'PawsXPairClassification', 'SIB200ClusteringS2S', 'SIQA', 'SentimentAnalysisHindi', 'SpartQA', 'StackExchangeClusteringP2P.v2', 'SweFaqRetrieval
', 'SwednClusteringP2P', 'TV2Nordretrieval', 'Tatoeba', 'TempReasonL3Pure', 'TenKGnadClusteringS2S', 'ToxicChatClassification', 'TweetSentimentClassification
', 'UrduRomanSentimentClassification', 'indonli', 'AllegroReviews', 'Banking77Classification', 'CQADupstackAndroidRetrieval', 'CQADupstackEnglishRetrieval', 
'CQADupstackRetrieval', 'CmedqaRetrieval', 'ImdbClassification', 'MasakhaNEWSClusteringP2P', 'MasakhaNEWSClusteringS2S', 'MindSmallReranking', 'PolEmo2.0-OUT
', 'STSB', 'ToxicConversationsClassification', 'TwitterURLCorpus']
Alibaba-NLP/gte-Qwen2-7B-instruct:  ['AlphaNLI', 'ArXivHierarchicalClusteringP2P', 'KorSarcasmClassification', 'KurdishSentimentClassification', 'LinceMTBite
xtMining', 'MedrxivClusteringP2P.v2', 'MedrxivClusteringS2S.v2', 'STS22', 'STSES', 'StackExchangeClusteringP2P.v2', 'Touche2020Retrieval.v3', 'AlloProfCluste
ringS2S', 'ArguAna-PL', 'CQADupstackAndroidRetrieval', 'MedicalRetrieval', 'StackExchangeClusteringP2P', 'SyntecReranking', 'TwentyNewsgroupsClustering']
Alibaba-NLP/gte-base-en-v1.5:  ['AmazonCounterfactualClassification', 'ArguAna', 'BiorxivClusteringS2S', 'FiQA2018', 'RedditClustering', 'TwitterSemEval2015'
]
Classical/Yinka:  ['DuRetrieval', 'OnlineShopping', 'ThuNewsClusteringS2S', 'VideoRetrieval']
Cohere/Cohere-embed-english-light-v3.0:  ['MalteseNewsClassification', 'PIQA', 'Tatoeba', 'TempReasonL2Context', 'TempReasonL3Context', 'TenKGnadClusteringP2
P', 'STS17']
Cohere/Cohere-embed-english-v3.0:  ['CQADupstackGisRetrieval', 'CzechProductReviewSentimentClassification', 'PIQA', 'TweetTopicSingleClassification', 'VGHier
archicalClusteringS2S', 'FEVER', 'SICK-R', 'TweetSentimentExtractionClassification']
Cohere/Cohere-embed-multilingual-light-v3.0:  ['DBpediaClassification', 'NorQuadRetrieval']
Cohere/Cohere-embed-multilingual-v3.0:  ['MedrxivClusteringS2S.v2', 'WikiCitiesClustering', 'TwitterSemEval2015']
DeepPavlov/distilrubert-small-cased-conversational:  ['PolEmo2.0-OUT', 'RuSTSBenchmarkSTS']
Gameselo/STS-multilingual-mpnet-base-v2:  ['AppsRetrieval', 'SwissJudgementClassification']
Haon-Chen/speed-embedding-7b-instruct:  ['AILACasedocs', 'AngryTweetsClassification', 'BlurbsClusteringP2P', 'CLSClusteringP2P.v2', 'CSFDSKMovieReviewSentime
ntClassification', 'CmedqaRetrieval', 'CodeTransOceanContest', 'CodeTransOceanDL', 'GeoreviewClusteringP2P', 'ItaCaseholdClassification', 'LccSentimentClassi
fication', 'MTOPDomainClassification', 'NLPJournalTitleIntroRetrieval', 'RTE3', 'SNLHierarchicalClusteringS2S', 'STS22.v2', 'SwahiliNewsClassification', 'TER
Ra', 'TenKGnadClusteringS2S', 'UrduRomanSentimentClassification', 'WikiCitiesClustering', 'WikipediaRerankingMultilingual', 'AmazonPolarityClassification', '
ArxivClusteringS2S', 'BIOSSES', 'BiorxivClusteringS2S', 'DBPedia', 'FiQA2018', 'ImdbClassification', 'RedditClustering', 'STS17', 'StackExchangeClustering', 
'TRECCOVID', 'TwentyNewsgroupsClustering']
Lajavaness/bilingual-embedding-large:  ['CSFDSKMovieReviewSentimentClassification', 'IndicGenBenchFloresBitextMining', 'FiQA2018']
Linq-AI-Research/Linq-Embed-Mistral:  ['AILACasedocs', 'AILAStatutes', 'ARCChallenge', 'AllegroReviews', 'AlloprofReranking', 'AngryTweetsClassification', 'B
elebeleRetrieval', 'BiorxivClusteringP2P.v2', 'BornholmBitextMining', 'CEDRClassification', 'CzechProductReviewSentimentClassification', 'IN22GenBitextMining
', 'KorSarcasmClassification', 'LegalBenchCorporateLobbying', 'MIRACLRetrieval', 'NLPJournalTitleIntroRetrieval', 'PolEmo2.0-OUT', 'RARbCode', 'RuSciBenchOEC
DClusteringP2P', 'SICK-E-PL', 'SNLRetrieval', 'SinhalaNewsClassification', 'SlovakMovieReviewSentimentClassification', 'StackOverflowQA', 'SwedishSentimentCl
assification', 'TempReasonL3Context', 'TempReasonL3Fact', 'TenKGnadClusteringP2P', 'Touche2020Retrieval.v3', 'TweetSentimentClassification', 'AmazonPolarityC
lassification', 'ClimateFEVER', 'TwitterSemEval2015']
Mihaiii/Ivysaur:  ['LEMBNarrativeQARetrieval', 'LEMBPasskeyRetrieval']
Mihaiii/Venusaur:  ['AfriSentiClassification']
OrdalieTech/Solon-embeddings-large-0.1:  ['CosQA', 'StatcanDialogueDatasetRetrieval']
OrlikB/KartonBERT-USE-base-v1:  ['MewsC16JaClustering', 'RuReviewsClassification']
OrlikB/st-polish-kartonberta-base-alpha-v1:  ['GermanQuAD-Retrieval', 'SanskritShlokasClassification', 'SwednClusteringP2P', 'UrduRomanSentimentClassificatio
n']
aari1995/German_Semantic_STS_V2:  ['FalseFriendsGermanEnglish']
amazon/Titan-text-embeddings-v2:  ['AmazonPolarityClassification', 'MSMARCO', 'STS17']
avsolatorio/GIST-all-MiniLM-L6-v2:  ['CodeSearchNetCCRetrieval', 'GermanDPR', 'HindiDiscourseClassification', 'MSMARCO']
avsolatorio/GIST-small-Embedding-v0:  ['CosQA', 'FEVER', 'SCIDOCS']
avsolatorio/NoInstruct-small-Embedding-v0:  ['CQADupstackPhysicsRetrieval']
bedrock/cohere-embed-english-v3:  ['WikipediaBiolumNeurochemClassification']
bedrock/cohere-embed-multilingual-v3:  ['ChemHotpotQARetrieval', 'WikipediaGreenhouseEnantiopureClassification']
bigscience/sgpt-bloom-7b1-msmarco:  ['BSARDRetrieval', 'CDSC-E', 'MasakhaNEWSClassification', 'NusaX-senti', 'SemRel24STS', 'ArxivClusteringP2P', 'SCIDOCS', 
'STS15']
bm25s:  ['SciFact']
brahmairesearch/slx-v0.1:  ['MSMARCO', 'NQ']
cointegrated/LaBSE-en-ru:  ['BUCC.v2', 'BigPatentClustering.v2', 'DiaBlaBitextMining', 'MalteseNewsClassification', 'NTREXBitextMining', 'PhincBitextMining',
 'SiswatiNewsClassification']
consciousAI/cai-lunaris-text-embeddings:  ['BibleNLPBitextMining', 'LEMBPasskeyRetrieval', 'RTE3', 'SentimentAnalysisHindi', 'SwednClusteringS2S', 'STSBenchm
ark']
deepfile/embedder-100p:  ['AlloprofReranking', 'CQADupstackStatsRetrieval', 'ItaCaseholdClassification', 'MalteseNewsClassification', 'MedrxivClusteringS2S.v
2', 'RuSciBenchOECDClassification', 'SwedishSentimentClassification', 'SyntecRetrieval', 'ArxivClusteringP2P', 'NQ']
dunzhang/stella-mrl-large-zh-v3.5-1792d:  ['MultilingualSentiment']
dwzhu/e5-base-4k:  ['CodeSearchNetRetrieval', 'SummEvalSummarization.v2', 'SyntheticText2SQL', 'XNLI', 'ArxivClusteringS2S', 'FEVER', 'SciDocsRR']
google/text-embedding-004:  ['STS12']
infgrad/stella-base-en-v2:  ['AILAStatutes', 'CQADupstackGisRetrieval', 'CQADupstackRetrieval', 'HALClusteringS2S.v2', 'NepaliNewsClassification', 'SlovakMov
ieReviewSentimentClassification', 'VGHierarchicalClusteringS2S', 'BiorxivClusteringP2P', 'NFCorpus', 'ToxicConversationsClassification']
izhx/udever-bloom-1b1:  ['AlloProfClusteringS2S', 'CQADupstackWordpressRetrieval']
izhx/udever-bloom-3b:  ['HeadlineClassification', 'IndonesianIdClickbaitClassification', 'MultiHateClassification', 'CQADupstackPhysicsRetrieval', 'EmotionCl
assification', 'FiQA2018', 'LCQMC', 'T2Retrieval']
izhx/udever-bloom-560m:  ['NordicLangClassification', 'ImdbClassification', 'Touche2020']
izhx/udever-bloom-7b1:  ['CBD', 'LivedoorNewsClustering.v2', 'PoemSentimentClassification', 'RuSTSBenchmarkSTS', 'SummEvalFr', 'CQADupstackGisRetrieval']
lier007/xiaobu-embedding:  ['TNews', 'VideoRetrieval']
llmrails/ember-v1:  ['STS22']
mixedbread-ai/mxbai-embed-2d-large-v1:  ['AmazonReviewsClassification']
omarelshehy/arabic-english-sts-matryoshka:  ['FinancialPhrasebankClassification']
openai/text-embedding-3-large:  ['BigPatentClustering.v2', 'SNLHierarchicalClusteringP2P', 'TempReasonL2Fact', 'Touche2020', 'SanskritShlokasClassification']
openai/text-embedding-3-small:  ['AILAStatutes', 'ArguAna', 'BiorxivClusteringS2S', 'QuoraRetrieval', 'ToxicConversationsClassification']
openai/text-embedding-ada-002:  ['PubChemSMILESPC', 'WikipediaCryobiologySeparationClassification']
sdadas/mmlw-e5-base:  ['CUREv1']
sdadas/mmlw-e5-large:  ['DBPedia-PL']
sergeyzh/LaBSE-ru-turbo:  ['ARCChallenge', 'RuSciBenchOECDClassification', 'RuSciBenchOECDClusteringP2P']
thenlper/gte-large:  ['CQADupstackMathematicaRetrieval', 'SIB200ClusteringS2S', 'STS13']
thenlper/gte-large-zh:  ['CMedQAv1-reranking', 'EcomRetrieval', 'JDReview', 'MMarcoReranking', 'MedicalRetrieval', 'MultilingualSentiment', 'STS22']
thenlper/gte-small:  ['TempReasonL2Pure', 'STS17']
thenlper/gte-small-zh:  ['ThuNewsClusteringS2S']
voyageai/voyage-multimodal-3:  ['CIFAR100Clustering', 'CVBenchDistance', 'Fashion200kT2IRetrieval', 'ImageNetDog15Clustering']

x-tabdeveloping · 2025-02-25T12:00:36Z

x-tabdeveloping
Feb 25, 2025
Collaborator Author

I HAVE NOT RIGOROUSLY CROSS-VALIDATED MY RESULTS, THIS MIGHT BE INCORRECT

0 replies

Samoed · 2025-02-25T12:00:48Z

Samoed
Feb 25, 2025
Maintainer

I think this issue is related #1636

0 replies

KennethEnevoldsen · 2025-02-25T14:06:10Z

KennethEnevoldsen
Feb 25, 2025
Maintainer

Should we examine discrepancies between a held-out group of models to see what it misses, what it gets wrong, and what it gets correct? (just glancing over these, there seem to be a lot of false positives; we might adjust the threshold to do this)

Will you also share the script: - e.g., just drop it in the scripts folder in a branch

2 replies

x-tabdeveloping Feb 25, 2025
Collaborator Author

Sure, will do, I just thought I would share some preliminaries. I will look into this more rigorously soon

KennethEnevoldsen Feb 25, 2025
Maintainer

no very happy that you looked into this

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Detection of non-zero-shot models from the annotations we have #2158

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Detection of non-zero-shot models from the annotations we have #2158

Uh oh!

Uh oh!

x-tabdeveloping Feb 25, 2025 Collaborator

Replies: 3 comments · 2 replies

Uh oh!

Uh oh!

x-tabdeveloping Feb 25, 2025 Collaborator Author

Uh oh!

Samoed Feb 25, 2025 Maintainer

Uh oh!

KennethEnevoldsen Feb 25, 2025 Maintainer

Uh oh!

x-tabdeveloping Feb 25, 2025 Collaborator Author

Uh oh!

KennethEnevoldsen Feb 25, 2025 Maintainer

x-tabdeveloping
Feb 25, 2025
Collaborator

Replies: 3 comments 2 replies

x-tabdeveloping
Feb 25, 2025
Collaborator Author

Samoed
Feb 25, 2025
Maintainer

KennethEnevoldsen
Feb 25, 2025
Maintainer

x-tabdeveloping Feb 25, 2025
Collaborator Author

KennethEnevoldsen Feb 25, 2025
Maintainer