update docs

ZiyiXia · ZiyiXia · commit 875fd4ffcb89 · 2025-06-04T17:57:10.000+08:00
diff --git a/docs/source/bge/bge_code.rst b/docs/source/bge/bge_code.rst
@@ -0,0 +1,36 @@
+BGE-Code-v1
+===========
+
+**`BGE-Code-v1 <https://huggingface.co/BAAI/bge-code-v1>`_** is an LLM-based code embedding model that supports code retrieval, text retrieval, and multilingual retrieval. It primarily demonstrates the following capabilities:
+- Superior Code Retrieval Performance: The model demonstrates exceptional code retrieval capabilities, supporting natural language queries in both English and Chinese, as well as 20 programming languages.
+- Robust Text Retrieval Capabilities: The model maintains strong text retrieval capabilities comparable to text embedding models of similar scale.
+- Extensive Multilingual Support: BGE-Code-v1 offers comprehensive multilingual retrieval capabilities, excelling in languages such as English, Chinese, Japanese, French, and more.
+
++-------------------------------------------------------------------+-----------------+------------+--------------+----------------------------------------------------------------------------------------------------+
+|                                  Model                            |    Language     | Parameters |  Model Size  |                                            Description                                             |
++===================================================================+=================+============+==============+====================================================================================================+
+| `BAAI/bge-code-v1 <https://huggingface.co/BAAI/bge-code-v1>`_       |     Multilingual     |    1.5B    |    6.18 GB   | SOTA code retrieval model, with exceptional multilingual text retrieval performance as well |
++-------------------------------------------------------------------+-----------------+------------+--------------+----------------------------------------------------------------------------------------------------+
+
+
+.. code:: python
+    from FlagEmbedding import FlagLLMModel
+
+    queries = [
+        "Delete the record with ID 4 from the 'Staff' table.", 
+        'Delete all records in the "Livestock" table where age is greater than 5'
+    ]
+    documents = [
+        "DELETE FROM Staff WHERE StaffID = 4;",
+        "DELETE FROM Livestock WHERE age > 5;"
+    ]
+    
+    model = FlagLLMModel('BAAI/bge-code-v1', 
+                        query_instruction_format="<instruct>{}\n<query>{}",
+                        query_instruction_for_retrieval="Given a question in text, retrieve SQL queries that are appropriate responses to the question.",
+                        trust_remote_code=True,
+                        use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation
+    embeddings_1 = model.encode_queries(queries)
+    embeddings_2 = model.encode_corpus(documents)
+    similarity = embeddings_1 @ embeddings_2.T
+    print(similarity)
diff --git a/docs/source/bge/bge_vl.rst b/docs/source/bge/bge_vl.rst
@@ -16,6 +16,8 @@ BGE-VL contains light weight CLIP based models as well as more powerful LLAVA-Ne
 +----------------------------------------------------------------------+-----------+------------+--------------+-----------------------------------------------------------------------+
 | `BAAI/bge-vl-MLLM-S2 <https://huggingface.co/BAAI/BGE-VL-MLLM-S2>`_  |  English  |    7.57B   |   15.14 GB   |   Finetune BGE-VL-MLLM-S1 with one epoch on MMEB training set         |
 +----------------------------------------------------------------------+-----------+------------+--------------+-----------------------------------------------------------------------+
+| `BAAI/BGE-VL-v1.5-zs <https://huggingface.co/BAAI/BGE-VL-v1.5-zs>`_    |  English  |   7.57B   |   15.14 GB   |    Better multi-modal retrieval model with performs well in all kinds of tasks    |
+| `BAAI/BGE-VL-v1.5-mmeb <https://huggingface.co/BAAI/BGE-VL-v1.5-mmeb>`_    |  English  |   7.57B   |   15.14 GB   |    Better multi-modal retrieval model, additionally fine-tuned on MMEB training set    |
 
 
 BGE-VL-CLIP
@@ -107,4 +109,50 @@ The normalized last hidden state of the [EOS] token in the MLLM is used as the e
     print(scores)
 
 
+BGE-VL-v1.5
+-----------
+
+BGE-VL-v1.5 series is the updated version of BGE-VL, bringing better performance on both retrieval and multi-modal understanding. The models were trained on 30M MegaPairs data and extra 10M natural and synthetic data.
+
+`bge-vl-v1.5-zs` is a zero-shot model, only trained on the data mentioned above. `bge-vl-v1.5-mmeb` is the fine-tuned version on MMEB training set.
+
+
+.. code:: python
+
+    import torch
+    from transformers import AutoModel
+    from PIL import Image
+
+    MODEL_NAME= "BAAI/BGE-VL-v1.5-mmeb" # "BAAI/BGE-VL-v1.5-zs"
+
+    model = AutoModel.from_pretrained(MODEL_NAME, trust_remote_code=True)
+    model.eval()
+    model.cuda()
+
+    with torch.no_grad():
+        model.set_processor(MODEL_NAME)
+
+        query_inputs = model.data_process(
+            text="Make the background dark, as if the camera has taken the photo at night", 
+            images="../../imgs/cir_query.png",
+            q_or_c="q",
+            task_instruction="Retrieve the target image that best meets the combined criteria by using both the provided image and the image retrieval instructions: "
+        )
+
+        candidate_inputs = model.data_process(
+            images=["../../imgs/cir_candi_1.png", "../../imgs/cir_candi_2.png"],
+            q_or_c="c",
+        )
+
+        query_embs = model(**query_inputs, output_hidden_states=True)[:, -1, :]
+        candi_embs = model(**candidate_inputs, output_hidden_states=True)[:, -1, :]
+        
+        query_embs = torch.nn.functional.normalize(query_embs, dim=-1)
+        candi_embs = torch.nn.functional.normalize(candi_embs, dim=-1)
+
+        scores = torch.matmul(query_embs, candi_embs.T)
+    print(scores)
+
+
+
 For more details, check out the repo of `MegaPairs <https://github.com/VectorSpaceLab/MegaPairs>`_
diff --git a/docs/source/bge/index.rst b/docs/source/bge/index.rst
@@ -15,6 +15,7 @@ BGE
    bge_m3
    bge_icl
    bge_vl
+   bge_code
 
 .. toctree::
    :maxdepth: 1