diff --git a/FlagEmbedding/LLARA/README.md b/FlagEmbedding/LLARA/README.md index 7ebf189d..53e112cc 100644 --- a/FlagEmbedding/LLARA/README.md +++ b/FlagEmbedding/LLARA/README.md @@ -15,7 +15,7 @@ It is known for the following features: ## Environment ```bash -conda create llara python=3.10 +conda create -n llara python=3.10 conda activate llara @@ -23,6 +23,8 @@ conda activate llara conda install pytorch pytorch-cuda=12.1 -c pytorch -c nvidia pip install transformers==4.41.0 deepspeed accelerate datasets peft pandas pip install flash-attn --no-build-isolation +pip install protobuf==4.25.1 +pip install sentencepiece==0.1.99 ``` ## Model List diff --git a/Tutorials/3_Indexing/indexing.ipynb b/Tutorials/3_Indexing/3.1.1_Intro_to_Faiss.ipynb similarity index 99% rename from Tutorials/3_Indexing/indexing.ipynb rename to Tutorials/3_Indexing/3.1.1_Intro_to_Faiss.ipynb index e2e2e123..46a157d2 100644 --- a/Tutorials/3_Indexing/indexing.ipynb +++ b/Tutorials/3_Indexing/3.1.1_Intro_to_Faiss.ipynb @@ -4,7 +4,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Indexing" + "# Indexing Using Faiss" ] }, { diff --git a/Tutorials/3_Indexing/3.1.2_Faiss_GPU.ipynb b/Tutorials/3_Indexing/3.1.2_Faiss_GPU.ipynb new file mode 100644 index 00000000..b75cb5ed --- /dev/null +++ b/Tutorials/3_Indexing/3.1.2_Faiss_GPU.ipynb @@ -0,0 +1,373 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Faiss GPU" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the last tutorial, we went through the basics of indexing using faiss-cpu. While for the use cases in research and industry. The size of dataset for indexing will be extremely large, the frequency of searching might also be very high. In this tutorial we'll see how to combine Faiss and GPU almost seamlessly." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Installation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Faiss maintain the latest updates on conda. And its gpu version only supports Linux x86_64\n", + "\n", + "create a conda virtual environment and run:\n", + "\n", + "```conda install -c pytorch -c nvidia faiss-gpu=1.8.0```\n", + "\n", + "make sure you select that conda env as the kernel for this notebook. After installation, restart the kernal." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If your system does not satisfy the requirement, install faiss-cpu and just skip the steps with gpu related codes." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Data Preparation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "First let's create two datasets with \"fake embeddings\" of corpus and queries:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import faiss\n", + "import numpy as np\n", + "\n", + "dim = 768\n", + "corpus_size = 1000\n", + "# np.random.seed(111)\n", + "\n", + "corpus = np.random.random((corpus_size, dim)).astype('float32')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Create Index on CPU" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Option 1:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Faiss provides a great amount of choices of indexes by initializing directly:" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "# first build a flat index (on CPU)\n", + "index = faiss.IndexFlatIP(dim)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Option 2:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Besides the basic index class, we can also use the index_factory function to produce composite Faiss index." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "index = faiss.index_factory(dim, \"Flat\", faiss.METRIC_L2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Build GPU Index and Search" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "All the GPU indexes are built with `StandardGpuResources` object. It contains all the needed resources for each GPU in use. By default it will allocate 18% of the total VRAM as a temporary scratch space." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The `GpuClonerOptions` and `GpuMultipleClonerOptions` objects are optional when creating index from cpu to gpu. They are used to adjust the way the GPUs stores the objects." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Single GPU:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "# use a single GPU\n", + "rs = faiss.StandardGpuResources()\n", + "co = faiss.GpuClonerOptions()\n", + "\n", + "# then make it to gpu index\n", + "index_gpu = faiss.index_cpu_to_gpu(provider=rs, device=0, index=index, options=co)" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 5.31 ms, sys: 6.26 ms, total: 11.6 ms\n", + "Wall time: 8.94 ms\n" + ] + } + ], + "source": [ + "%%time\n", + "index_gpu.add(corpus)\n", + "D, I = index_gpu.search(corpus, 4)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### All Available GPUs" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If your system contains multiple GPUs, Faiss provides the option to deploy al available GPUs. You can control their usages through `GpuMultipleClonerOptions`, e.g. whether to shard or replicate the index acrross GPUs." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "# cloner options for multiple GPUs\n", + "co = faiss.GpuMultipleClonerOptions()\n", + "\n", + "index_gpu = faiss.index_cpu_to_all_gpus(index=index, co=co)" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 29.8 ms, sys: 26.8 ms, total: 56.6 ms\n", + "Wall time: 33.9 ms\n" + ] + } + ], + "source": [ + "%%time\n", + "index_gpu.add(corpus)\n", + "D, I = index_gpu.search(corpus, 4)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Multiple GPUs" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "There's also option that use multiple GPUs but not all:" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "ngpu = 4\n", + "resources = [faiss.StandardGpuResources() for _ in range(ngpu)]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Create vectors for the GpuResources and divices, then pass them to the index_cpu_to_gpu_multiple() function." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "vres = faiss.GpuResourcesVector()\n", + "vdev = faiss.Int32Vector()\n", + "for i, res in zip(range(ngpu), resources):\n", + " vdev.push_back(i)\n", + " vres.push_back(res)\n", + "index_gpu = faiss.index_cpu_to_gpu_multiple(vres, vdev, index)" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 3.49 ms, sys: 13.4 ms, total: 16.9 ms\n", + "Wall time: 9.03 ms\n" + ] + } + ], + "source": [ + "%%time\n", + "index_gpu.add(corpus)\n", + "D, I = index_gpu.search(corpus, 4)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5. Results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "All the three approaches should lead to identical result. Now let's do a quick sanity check:" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "# The nearest neighbor of each vector in the corpus is itself\n", + "assert np.all(corpus[:] == corpus[I[:, 0]])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And the corresponding distance should be 0." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[[ 0. 111.30057 113.2251 113.342316]\n", + " [ 0. 111.158875 111.742325 112.09038 ]\n", + " [ 0. 116.44429 116.849915 117.30502 ]]\n" + ] + } + ], + "source": [ + "print(D[:3])" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "faiss", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/Tutorials/4_Evaluation/4.2.2_MTEB_Leaderboard.ipynb b/Tutorials/4_Evaluation/4.2.2_MTEB_Leaderboard.ipynb new file mode 100644 index 00000000..b60c1984 --- /dev/null +++ b/Tutorials/4_Evaluation/4.2.2_MTEB_Leaderboard.ipynb @@ -0,0 +1,302 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# MTEB Leaderboard" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the last tutorial we show how to evaluate an embedding model on an dataset supported by MTEB. In this tutorial, we will go through how to do a full evaluation and compare the results with MTEB English leaderboard.\n", + "\n", + "Caution: Evaluation on the full Eng MTEB is very time consuming even with GPU. So we encourage you to go through the notebook to have an idea. And run the experiment when you have enough computing resource and time." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 0. Installation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Install the packages we will use in your environment:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "%pip install sentence_transformers mteb" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Run the Evaluation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The MTEB English leaderboard contains 56 datasets on 7 tasks:\n", + "1. **Classification**: Use the embeddings to train a logistic regression on the train set and is scored on the test set. F1 is the main metric.\n", + "2. **Clustering**: Train a mini-batch k-means model with batch size 32 and k equals to the number of different labels. Then score using v-measure.\n", + "3. **Pair Classification**: A pair of text inputs is provided and a label which is a binary variable needs to be assigned. The main metric is average precision score.\n", + "4. **Reranking**: Rank a list of relevant and irrelevant reference texts according to a query. Metrics are mean MRR@k and MAP.\n", + "5. **Retrieval**: Each dataset comprises corpus, queries, and a mapping that links each query to its relevant documents within the corpus. The goal is to retrieve relevant documents for each query. The main metric is nDCG@k. MTEB directly adopts BEIR for the retrieval task.\n", + "6. **Semantic Textual Similarity (STS)**: Determine the similarity between each sentence pair. Spearman correlation based on cosine\n", + "similarity serves as the main metric.\n", + "7. **Summarization**: Only 1 dataset is used in this task. Score the machine-generated summaries to human-written summaries by computing distances of their embeddings. The main metric is also Spearman correlation based on cosine similarity.\n", + "\n", + "The benchmark is widely accepted by researchers and engineers to fairly evaluate and compare the performance of the models they train. Now let's take a look at the whole evaluation pipeline" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Import the `MTEB_MAIN_EN` to check the all 56 datasets." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['AmazonCounterfactualClassification', 'AmazonPolarityClassification', 'AmazonReviewsClassification', 'ArguAna', 'ArxivClusteringP2P', 'ArxivClusteringS2S', 'AskUbuntuDupQuestions', 'BIOSSES', 'Banking77Classification', 'BiorxivClusteringP2P', 'BiorxivClusteringS2S', 'CQADupstackAndroidRetrieval', 'CQADupstackEnglishRetrieval', 'CQADupstackGamingRetrieval', 'CQADupstackGisRetrieval', 'CQADupstackMathematicaRetrieval', 'CQADupstackPhysicsRetrieval', 'CQADupstackProgrammersRetrieval', 'CQADupstackStatsRetrieval', 'CQADupstackTexRetrieval', 'CQADupstackUnixRetrieval', 'CQADupstackWebmastersRetrieval', 'CQADupstackWordpressRetrieval', 'ClimateFEVER', 'DBPedia', 'EmotionClassification', 'FEVER', 'FiQA2018', 'HotpotQA', 'ImdbClassification', 'MSMARCO', 'MTOPDomainClassification', 'MTOPIntentClassification', 'MassiveIntentClassification', 'MassiveScenarioClassification', 'MedrxivClusteringP2P', 'MedrxivClusteringS2S', 'MindSmallReranking', 'NFCorpus', 'NQ', 'QuoraRetrieval', 'RedditClustering', 'RedditClusteringP2P', 'SCIDOCS', 'SICK-R', 'STS12', 'STS13', 'STS14', 'STS15', 'STS16', 'STS17', 'STS22', 'STSBenchmark', 'SciDocsRR', 'SciFact', 'SprintDuplicateQuestions', 'StackExchangeClustering', 'StackExchangeClusteringP2P', 'StackOverflowDupQuestions', 'SummEval', 'TRECCOVID', 'Touche2020', 'ToxicConversationsClassification', 'TweetSentimentExtractionClassification', 'TwentyNewsgroupsClustering', 'TwitterSemEval2015', 'TwitterURLCorpus']\n" + ] + } + ], + "source": [ + "import mteb\n", + "from mteb.benchmarks import MTEB_MAIN_EN\n", + "\n", + "print(MTEB_MAIN_EN.tasks)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Load the model we want to evaluate:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sentence_transformers import SentenceTransformer\n", + "\n", + "model_name = \"BAAI/bge-base-en-v1.5\"\n", + "model = SentenceTransformer(model_name)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Alternatively, MTEB provides popular models on their leaderboard in order to reproduce their results." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "model_name = \"BAAI/bge-base-en-v1.5\"\n", + "model = mteb.get_model(model_name)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Then start to evaluate on each dataset:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for task in MTEB_MAIN_EN.tasks:\n", + " # get the test set to evaluate on\n", + " eval_splits = [\"dev\"] if task == \"MSMARCO\" else [\"test\"]\n", + " evaluation = MTEB(\n", + " tasks=[task], task_langs=[\"en\"]\n", + " ) # Remove \"en\" to run all available languages\n", + " evaluation.run(\n", + " model, output_folder=\"results\", eval_splits=eval_splits\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Submit to MTEB Leaderboard" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "After the evaluation is done, all the evaluation results should be stored in `results/{model_name}/{model_revision}`.\n", + "\n", + "Then run the following shell command to create the model_card.md. Change {model_name} and {model_revision} to your path." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!mteb create_meta --results_folder results/{model_name}/{model_revision} --output_path model_card.md" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For the case that the readme of that model already exists:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# !mteb create_meta --results_folder results/{model_name}/{model_revision} --output_path model_card.md --from_existing your_existing_readme.md " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Copy and paste the contents of model_card.md to the top of README.md of your model on HF Hub. Now relax and wait for the daily refresh of leaderboard. Your model will show up soon!" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Partially Evaluate" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that you don't need to finish all the tasks to get on to the leaderboard.\n", + "\n", + "For example you fine-tune a model's ability on clustering. And you only care about how your model performs with respoect to clustering, but not the other tasks. Then you can just test its performance on the clustering tasks of MTEB and submit to the leaderboard." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "TASK_LIST_CLUSTERING = [\n", + " \"ArxivClusteringP2P\",\n", + " \"ArxivClusteringS2S\",\n", + " \"BiorxivClusteringP2P\",\n", + " \"BiorxivClusteringS2S\",\n", + " \"MedrxivClusteringP2P\",\n", + " \"MedrxivClusteringS2S\",\n", + " \"RedditClustering\",\n", + " \"RedditClusteringP2P\",\n", + " \"StackExchangeClustering\",\n", + " \"StackExchangeClusteringP2P\",\n", + " \"TwentyNewsgroupsClustering\",\n", + "]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Run the evaluation with only clustering tasks:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "evaluation = mteb.MTEB(tasks=TASK_LIST_CLUSTERING)\n", + "\n", + "results = evaluation.run(model, output_folder=\"results\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Then repeat Step 2 to submit your model. After the leaderboard refresh, you can find your model in the \"Clustering\" section of the leaderboard." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Future Work" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "MTEB is working on a new version of English benchmark. It contains updated and concise tasks and will make the evaluation process faster.\n", + "\n", + "Please check out their [GitHub](https://github.com/embeddings-benchmark/mteb) page for future updates and releases." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "base", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.13" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/Tutorials/README.md b/Tutorials/README.md index 581f27dc..057a53b0 100644 --- a/Tutorials/README.md +++ b/Tutorials/README.md @@ -1,6 +1,8 @@ -# FlagEmbedding_tutorial +# Tutorial -If you are new to here, check out the 5 minute [quick start](./quick_start.ipynb)! +FlagEmbedding holds a whole curriculum for retrieval, embedding models, RAG, etc. This section is currently being actively updated. No matter you are new to NLP or a veteran, we hope you can find something helpful! + +If you are new to embedding and retrieval, check out the 5 minute [quick start](./quick_start.ipynb)!
Tutorial roadmap @@ -11,18 +13,43 @@ If you are new to here, check out the 5 minute [quick start](./quick_start.ipynb This module includes tutorials and demos showing how to use BGE and Sentence Transformers, as well as other embedding related topics. +- [x] Intro to embedding model +- [x] BGE series +- [x] Usage of BGE +- [x] BGE-M3 +- [ ] BGE-ICL +- ... + ## [Similarity](./2_Similarity) In this part, we show popular similarity functions and techniques about searching. +- [x] Similarity metrics +- ... + ## [Indexing](./3_Indexing) Although not included in the quick start, indexing is a very important part in practical cases. This module shows how to use popular libraries like Faiss and Milvus to do indexing. +- [x] Intro to Faiss +- [x] Using GPU in Faiss +- [ ] Index and Quantizer +- [ ] Milvus +- ... + ## [Evaluation](./4_Evaluation) In this module, we'll show the full pipeline of evaluating an embedding model, as well as popular benchmarks like MTEB and C-MTEB. +- [x] Evaluate MSMARCO +- [x] Intro to MTEB +- [x] MTEB Leaderboard Eval +- [ ] C-MTEB +- ... + ## [Reranking](./5_Reranking/) To balance accuracy and efficiency tradeoff, many retrieval system use a more efficient retriever to quickly narrow down the candidates. Then use more accurate models do reranking for the final results. + +- [x] Intro to reranker +- ... \ No newline at end of file