Skip to content

Commit 30faa0f

Browse files
authored
Merge pull request #144 from syedriko/syedriko-lcore-1538
LCORE-1538: Add embedding model metadata to the rag-content repo
2 parents 3c1856f + 67ceac2 commit 30faa0f

18 files changed

+32609
-2
lines changed

.coderabbit.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,8 @@ reviews:
2727
auto_assign_reviewers: false
2828
in_progress_fortune: true
2929
poem: false
30-
path_filters: []
30+
path_filters:
31+
- "!**/embeddings_model/**"
3132
path_instructions: []
3233
abort_on_close: true
3334
disable_cache: false

.github/workflows/pydocstyle.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,4 +17,4 @@ jobs:
1717
with:
1818
python-version: '3.12'
1919
- name: Python linter
20-
run: uv tool run pydocstyle -v .
20+
run: uv tool run pydocstyle -v src scripts tests

Containerfile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ COPY Makefile pyproject.toml uv.lock README.md Gemfile Gemfile.lock requirements
2020
COPY src ./src
2121
COPY tests ./tests
2222
COPY scripts ./scripts
23+
COPY embeddings_model ./embeddings_model
2324
COPY LICENSE /licenses/LICENSE
2425

2526
# Install Ruby Gems

embeddings_model/.gitattributes

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
*.7z filter=lfs diff=lfs merge=lfs -text
2+
*.arrow filter=lfs diff=lfs merge=lfs -text
3+
*.bin filter=lfs diff=lfs merge=lfs -text
4+
*.bin.* filter=lfs diff=lfs merge=lfs -text
5+
*.bz2 filter=lfs diff=lfs merge=lfs -text
6+
*.ftz filter=lfs diff=lfs merge=lfs -text
7+
*.gz filter=lfs diff=lfs merge=lfs -text
8+
*.h5 filter=lfs diff=lfs merge=lfs -text
9+
*.joblib filter=lfs diff=lfs merge=lfs -text
10+
*.lfs.* filter=lfs diff=lfs merge=lfs -text
11+
*.model filter=lfs diff=lfs merge=lfs -text
12+
*.msgpack filter=lfs diff=lfs merge=lfs -text
13+
*.onnx filter=lfs diff=lfs merge=lfs -text
14+
*.ot filter=lfs diff=lfs merge=lfs -text
15+
*.parquet filter=lfs diff=lfs merge=lfs -text
16+
*.pb filter=lfs diff=lfs merge=lfs -text
17+
*.pt filter=lfs diff=lfs merge=lfs -text
18+
*.pth filter=lfs diff=lfs merge=lfs -text
19+
*.rar filter=lfs diff=lfs merge=lfs -text
20+
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
21+
*.tar.* filter=lfs diff=lfs merge=lfs -text
22+
*.tflite filter=lfs diff=lfs merge=lfs -text
23+
*.tgz filter=lfs diff=lfs merge=lfs -text
24+
*.xz filter=lfs diff=lfs merge=lfs -text
25+
*.zip filter=lfs diff=lfs merge=lfs -text
26+
*.zstandard filter=lfs diff=lfs merge=lfs -text
27+
*tfevents* filter=lfs diff=lfs merge=lfs -text
28+
model.safetensors filter=lfs diff=lfs merge=lfs -text
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
{
2+
"word_embedding_dimension": 768,
3+
"pooling_mode_cls_token": false,
4+
"pooling_mode_mean_tokens": true,
5+
"pooling_mode_max_tokens": false,
6+
"pooling_mode_mean_sqrt_len_tokens": false
7+
}

embeddings_model/README.md

Lines changed: 177 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,177 @@
1+
---
2+
language: en
3+
license: apache-2.0
4+
library_name: sentence-transformers
5+
tags:
6+
- sentence-transformers
7+
- feature-extraction
8+
- sentence-similarity
9+
- transformers
10+
datasets:
11+
- s2orc
12+
- flax-sentence-embeddings/stackexchange_xml
13+
- ms_marco
14+
- gooaq
15+
- yahoo_answers_topics
16+
- code_search_net
17+
- search_qa
18+
- eli5
19+
- snli
20+
- multi_nli
21+
- wikihow
22+
- natural_questions
23+
- trivia_qa
24+
- embedding-data/sentence-compression
25+
- embedding-data/flickr30k-captions
26+
- embedding-data/altlex
27+
- embedding-data/simple-wiki
28+
- embedding-data/QQP
29+
- embedding-data/SPECTER
30+
- embedding-data/PAQ_pairs
31+
- embedding-data/WikiAnswers
32+
pipeline_tag: sentence-similarity
33+
---
34+
35+
36+
# all-mpnet-base-v2
37+
This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
38+
39+
## Usage (Sentence-Transformers)
40+
Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
41+
42+
```
43+
pip install -U sentence-transformers
44+
```
45+
46+
Then you can use the model like this:
47+
```python
48+
from sentence_transformers import SentenceTransformer
49+
sentences = ["This is an example sentence", "Each sentence is converted"]
50+
51+
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
52+
embeddings = model.encode(sentences)
53+
print(embeddings)
54+
```
55+
56+
## Usage (HuggingFace Transformers)
57+
Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
58+
59+
```python
60+
from transformers import AutoTokenizer, AutoModel
61+
import torch
62+
import torch.nn.functional as F
63+
64+
#Mean Pooling - Take attention mask into account for correct averaging
65+
def mean_pooling(model_output, attention_mask):
66+
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
67+
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
68+
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
69+
70+
71+
# Sentences we want sentence embeddings for
72+
sentences = ['This is an example sentence', 'Each sentence is converted']
73+
74+
# Load model from HuggingFace Hub
75+
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-mpnet-base-v2')
76+
model = AutoModel.from_pretrained('sentence-transformers/all-mpnet-base-v2')
77+
78+
# Tokenize sentences
79+
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
80+
81+
# Compute token embeddings
82+
with torch.no_grad():
83+
model_output = model(**encoded_input)
84+
85+
# Perform pooling
86+
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
87+
88+
# Normalize embeddings
89+
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
90+
91+
print("Sentence embeddings:")
92+
print(sentence_embeddings)
93+
```
94+
95+
## Evaluation Results
96+
97+
For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name=sentence-transformers/all-mpnet-base-v2)
98+
99+
------
100+
101+
## Background
102+
103+
The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised
104+
contrastive learning objective. We used the pretrained [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base) model and fine-tuned in on a
105+
1B sentence pairs dataset. We use a contrastive learning objective: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset.
106+
107+
We developped this model during the
108+
[Community week using JAX/Flax for NLP & CV](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104),
109+
organized by Hugging Face. We developped this model as part of the project:
110+
[Train the Best Sentence Embedding Model Ever with 1B Training Pairs](https://discuss.huggingface.co/t/train-the-best-sentence-embedding-model-ever-with-1b-training-pairs/7354). We benefited from efficient hardware infrastructure to run the project: 7 TPUs v3-8, as well as intervention from Googles Flax, JAX, and Cloud team member about efficient deep learning frameworks.
111+
112+
## Intended uses
113+
114+
Our model is intented to be used as a sentence and short paragraph encoder. Given an input text, it ouptuts a vector which captures
115+
the semantic information. The sentence vector may be used for information retrieval, clustering or sentence similarity tasks.
116+
117+
By default, input text longer than 384 word pieces is truncated.
118+
119+
120+
## Training procedure
121+
122+
### Pre-training
123+
124+
We use the pretrained [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base) model. Please refer to the model card for more detailed information about the pre-training procedure.
125+
126+
### Fine-tuning
127+
128+
We fine-tune the model using a contrastive objective. Formally, we compute the cosine similarity from each possible sentence pairs from the batch.
129+
We then apply the cross entropy loss by comparing with true pairs.
130+
131+
#### Hyper parameters
132+
133+
We trained ou model on a TPU v3-8. We train the model during 100k steps using a batch size of 1024 (128 per TPU core).
134+
We use a learning rate warm up of 500. The sequence length was limited to 128 tokens. We used the AdamW optimizer with
135+
a 2e-5 learning rate. The full training script is accessible in this current repository: `train_script.py`.
136+
137+
#### Training data
138+
139+
We use the concatenation from multiple datasets to fine-tune our model. The total number of sentence pairs is above 1 billion sentences.
140+
We sampled each dataset given a weighted probability which configuration is detailed in the `data_config.json` file.
141+
142+
143+
| Dataset | Paper | Number of training tuples |
144+
|--------------------------------------------------------|:----------------------------------------:|:--------------------------:|
145+
| [Reddit comments (2015-2018)](https://github.com/PolyAI-LDN/conversational-datasets/tree/master/reddit) | [paper](https://arxiv.org/abs/1904.06472) | 726,484,430 |
146+
| [S2ORC](https://github.com/allenai/s2orc) Citation pairs (Abstracts) | [paper](https://aclanthology.org/2020.acl-main.447/) | 116,288,806 |
147+
| [WikiAnswers](https://github.com/afader/oqa#wikianswers-corpus) Duplicate question pairs | [paper](https://doi.org/10.1145/2623330.2623677) | 77,427,422 |
148+
| [PAQ](https://github.com/facebookresearch/PAQ) (Question, Answer) pairs | [paper](https://arxiv.org/abs/2102.07033) | 64,371,441 |
149+
| [S2ORC](https://github.com/allenai/s2orc) Citation pairs (Titles) | [paper](https://aclanthology.org/2020.acl-main.447/) | 52,603,982 |
150+
| [S2ORC](https://github.com/allenai/s2orc) (Title, Abstract) | [paper](https://aclanthology.org/2020.acl-main.447/) | 41,769,185 |
151+
| [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) (Title, Body) pairs | - | 25,316,456 |
152+
| [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) (Title+Body, Answer) pairs | - | 21,396,559 |
153+
| [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) (Title, Answer) pairs | - | 21,396,559 |
154+
| [MS MARCO](https://microsoft.github.io/msmarco/) triplets | [paper](https://doi.org/10.1145/3404835.3462804) | 9,144,553 |
155+
| [GOOAQ: Open Question Answering with Diverse Answer Types](https://github.com/allenai/gooaq) | [paper](https://arxiv.org/pdf/2104.08727.pdf) | 3,012,496 |
156+
| [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Title, Answer) | [paper](https://proceedings.neurips.cc/paper/2015/hash/250cf8b51c773f3f8dc8b4be867a9a02-Abstract.html) | 1,198,260 |
157+
| [Code Search](https://huggingface.co/datasets/code_search_net) | - | 1,151,414 |
158+
| [COCO](https://cocodataset.org/#home) Image captions | [paper](https://link.springer.com/chapter/10.1007%2F978-3-319-10602-1_48) | 828,395|
159+
| [SPECTER](https://github.com/allenai/specter) citation triplets | [paper](https://doi.org/10.18653/v1/2020.acl-main.207) | 684,100 |
160+
| [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Question, Answer) | [paper](https://proceedings.neurips.cc/paper/2015/hash/250cf8b51c773f3f8dc8b4be867a9a02-Abstract.html) | 681,164 |
161+
| [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Title, Question) | [paper](https://proceedings.neurips.cc/paper/2015/hash/250cf8b51c773f3f8dc8b4be867a9a02-Abstract.html) | 659,896 |
162+
| [SearchQA](https://huggingface.co/datasets/search_qa) | [paper](https://arxiv.org/abs/1704.05179) | 582,261 |
163+
| [Eli5](https://huggingface.co/datasets/eli5) | [paper](https://doi.org/10.18653/v1/p19-1346) | 325,475 |
164+
| [Flickr 30k](https://shannon.cs.illinois.edu/DenotationGraph/) | [paper](https://transacl.org/ojs/index.php/tacl/article/view/229/33) | 317,695 |
165+
| [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (titles) | | 304,525 |
166+
| AllNLI ([SNLI](https://nlp.stanford.edu/projects/snli/) and [MultiNLI](https://cims.nyu.edu/~sbowman/multinli/) | [paper SNLI](https://doi.org/10.18653/v1/d15-1075), [paper MultiNLI](https://doi.org/10.18653/v1/n18-1101) | 277,230 |
167+
| [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (bodies) | | 250,519 |
168+
| [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions (titles+bodies) | | 250,460 |
169+
| [Sentence Compression](https://github.com/google-research-datasets/sentence-compression) | [paper](https://www.aclweb.org/anthology/D13-1155/) | 180,000 |
170+
| [Wikihow](https://github.com/pvl/wikihow_pairs_dataset) | [paper](https://arxiv.org/abs/1810.09305) | 128,542 |
171+
| [Altlex](https://github.com/chridey/altlex/) | [paper](https://aclanthology.org/P16-1135.pdf) | 112,696 |
172+
| [Quora Question Triplets](https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs) | - | 103,663 |
173+
| [Simple Wikipedia](https://cs.pomona.edu/~dkauchak/simplification/) | [paper](https://www.aclweb.org/anthology/P11-2117/) | 102,225 |
174+
| [Natural Questions (NQ)](https://ai.google.com/research/NaturalQuestions) | [paper](https://transacl.org/ojs/index.php/tacl/article/view/1455) | 100,231 |
175+
| [SQuAD2.0](https://rajpurkar.github.io/SQuAD-explorer/) | [paper](https://aclanthology.org/P18-2124.pdf) | 87,599 |
176+
| [TriviaQA](https://huggingface.co/datasets/trivia_qa) | - | 73,346 |
177+
| **Total** | | **1,170,060,424** |

embeddings_model/config.json

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
{
2+
"_name_or_path": "microsoft/mpnet-base",
3+
"architectures": [
4+
"MPNetForMaskedLM"
5+
],
6+
"attention_probs_dropout_prob": 0.1,
7+
"bos_token_id": 0,
8+
"eos_token_id": 2,
9+
"hidden_act": "gelu",
10+
"hidden_dropout_prob": 0.1,
11+
"hidden_size": 768,
12+
"initializer_range": 0.02,
13+
"intermediate_size": 3072,
14+
"layer_norm_eps": 1e-05,
15+
"max_position_embeddings": 514,
16+
"model_type": "mpnet",
17+
"num_attention_heads": 12,
18+
"num_hidden_layers": 12,
19+
"pad_token_id": 1,
20+
"relative_attention_num_buckets": 32,
21+
"transformers_version": "4.8.2",
22+
"vocab_size": 30527
23+
}
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
{
2+
"__version__": {
3+
"sentence_transformers": "2.0.0",
4+
"transformers": "4.6.1",
5+
"pytorch": "1.8.1"
6+
}
7+
}

0 commit comments

Comments
 (0)