Skip to content

Commit 51c3956

Browse files
committed
Updated to All v2 Dataset
1 parent 6802dac commit 51c3956

File tree

4 files changed

+99
-23
lines changed

4 files changed

+99
-23
lines changed

docs/training/all.md

Lines changed: 15 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -4,17 +4,21 @@ Inspired by [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-
44

55
## Training Data
66

7-
| Dataset | Task | Data Instance | Number of Training Tuples |
8-
| ------------------------------------------------------------------------------------ | :---------------------------: | :-------------------------------------------: | :-----------------------: |
9-
| [indonli](https://huggingface.co/datasets/indonli) | Natural Language Inference | `(premise, entailment, contradiction)` | 3,914 |
10-
| [indolem/indo_story_cloze](https://huggingface.co/datasets/indolem/indo_story_cloze) | Commonsense Reasoning | `(context, correct ending, incorrect ending)` | 1,000 |
11-
| [unicamp-dl/mmarco](https://huggingface.co/datasets/unicamp-dl/mmarco) | Passage Retrieval | `(query, positive passage, negative passage)` | 100,000 |
12-
| [miracl/miracl](https://huggingface.co/datasets/miracl/miracl) | Passage Retrieval | `(query, positive passage, negative passage)` | 8,086 |
13-
| [SEACrowd/wrete](https://huggingface.co/datasets/SEACrowd/wrete) | Textual Entailment | `(sentenceA, sentenceB)` | 183 |
14-
| [SEACrowd/indolem_ntp](https://huggingface.co/datasets/SEACrowd/indolem_ntp) | Textual Entailment | `(tweet, next tweet)` | 5,681 |
15-
| [khalidalt/tydiqa-goldp](https://huggingface.co/datasets/khalidalt/tydiqa-goldp) | Extractive Question-Answering | `(question, passage)`, `(question, answer)` | 11,404 |
16-
| [SEACrowd/facqa](https://huggingface.co/datasets/SEACrowd/facqa) | Extractive Question-Answering | `(question, passage)`, `(question, answer)` | 4,990 |
17-
| **Total** | | | **135,258** |
7+
| Dataset | Task | Data Instance | Number of Training Tuples |
8+
| ------------------------------------------------------------------------------------------------------------------ | :----------------------------: | :-------------------------------------------: | :-----------------------: |
9+
| [indonli](https://huggingface.co/datasets/indonli) | Natural Language Inference | `(premise, entailment, contradiction)` | 3,914 |
10+
| [indolem/indo_story_cloze](https://huggingface.co/datasets/indolem/indo_story_cloze) | Commonsense Reasoning | `(context, correct ending, incorrect ending)` | 1,000 |
11+
| [unicamp-dl/mmarco](https://huggingface.co/datasets/unicamp-dl/mmarco) | Passage Retrieval | `(query, positive passage, negative passage)` | 100,000 |
12+
| [miracl/miracl](https://huggingface.co/datasets/miracl/miracl) | Passage Retrieval | `(query, positive passage, negative passage)` | 8,086 |
13+
| [SEACrowd/wrete](https://huggingface.co/datasets/SEACrowd/wrete) | Textual Entailment | `(sentenceA, sentenceB)` | 183 |
14+
| [SEACrowd/indolem_ntp](https://huggingface.co/datasets/SEACrowd/indolem_ntp) | Textual Entailment | `(tweet, next tweet)` | 5,681 |
15+
| [khalidalt/tydiqa-goldp](https://huggingface.co/datasets/khalidalt/tydiqa-goldp) | Extractive Question-Answering | `(question, passage)`, `(question, answer)` | 11,404 |
16+
| [SEACrowd/facqa](https://huggingface.co/datasets/SEACrowd/facqa) | Extractive Question-Answering | `(question, passage)`, `(question, answer)` | 4,990 |
17+
| *included in v2* |
18+
| [indonesian-nlp/lfqa_id](https://huggingface.co/datasets/indonesian-nlp/lfqa_id) | Open-domain Question-Answering | `(question, answer)` | 226,147 |
19+
| [jakartaresearch/indoqa](https://huggingface.co/datasets/jakartaresearch/indoqa) | Extractive Question-Answering | `(question, passage)`, `(question, answer)` | 6,498 |
20+
| [jakartaresearch/id-paraphrase-detection](https://huggingface.co/datasets/jakartaresearch/id-paraphrase-detection) | Paraphrase | `(sentence, rephrased sentence)` | 4,076 |
21+
| **Total** | | | **371,979** |
1822

1923
## All Supervised Datasets with MultipleNegativesRankingLoss
2024

training/all/README.md

Lines changed: 15 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -4,17 +4,21 @@ Inspired by [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-
44

55
## Training Data
66

7-
| Dataset | Task | Data Instance | Number of Training Tuples |
8-
| ------------------------------------------------------------------------------------ | :---------------------------: | :-------------------------------------------: | :-----------------------: |
9-
| [indonli](https://huggingface.co/datasets/indonli) | Natural Language Inference | `(premise, entailment, contradiction)` | 3,914 |
10-
| [indolem/indo_story_cloze](https://huggingface.co/datasets/indolem/indo_story_cloze) | Commonsense Reasoning | `(context, correct ending, incorrect ending)` | 1,000 |
11-
| [unicamp-dl/mmarco](https://huggingface.co/datasets/unicamp-dl/mmarco) | Passage Retrieval | `(query, positive passage, negative passage)` | 100,000 |
12-
| [miracl/miracl](https://huggingface.co/datasets/miracl/miracl) | Passage Retrieval | `(query, positive passage, negative passage)` | 8,086 |
13-
| [SEACrowd/wrete](https://huggingface.co/datasets/SEACrowd/wrete) | Textual Entailment | `(sentenceA, sentenceB)` | 183 |
14-
| [SEACrowd/indolem_ntp](https://huggingface.co/datasets/SEACrowd/indolem_ntp) | Textual Entailment | `(tweet, next tweet)` | 5,681 |
15-
| [khalidalt/tydiqa-goldp](https://huggingface.co/datasets/khalidalt/tydiqa-goldp) | Extractive Question-Answering | `(question, passage)`, `(question, answer)` | 11,404 |
16-
| [SEACrowd/facqa](https://huggingface.co/datasets/SEACrowd/facqa) | Extractive Question-Answering | `(question, passage)`, `(question, answer)` | 4,990 |
17-
| **Total** | | | **135,258** |
7+
| Dataset | Task | Data Instance | Number of Training Tuples |
8+
| ------------------------------------------------------------------------------------------------------------------ | :----------------------------: | :-------------------------------------------: | :-----------------------: |
9+
| [indonli](https://huggingface.co/datasets/indonli) | Natural Language Inference | `(premise, entailment, contradiction)` | 3,914 |
10+
| [indolem/indo_story_cloze](https://huggingface.co/datasets/indolem/indo_story_cloze) | Commonsense Reasoning | `(context, correct ending, incorrect ending)` | 1,000 |
11+
| [unicamp-dl/mmarco](https://huggingface.co/datasets/unicamp-dl/mmarco) | Passage Retrieval | `(query, positive passage, negative passage)` | 100,000 |
12+
| [miracl/miracl](https://huggingface.co/datasets/miracl/miracl) | Passage Retrieval | `(query, positive passage, negative passage)` | 8,086 |
13+
| [SEACrowd/wrete](https://huggingface.co/datasets/SEACrowd/wrete) | Textual Entailment | `(sentenceA, sentenceB)` | 183 |
14+
| [SEACrowd/indolem_ntp](https://huggingface.co/datasets/SEACrowd/indolem_ntp) | Textual Entailment | `(tweet, next tweet)` | 5,681 |
15+
| [khalidalt/tydiqa-goldp](https://huggingface.co/datasets/khalidalt/tydiqa-goldp) | Extractive Question-Answering | `(question, passage)`, `(question, answer)` | 11,404 |
16+
| [SEACrowd/facqa](https://huggingface.co/datasets/SEACrowd/facqa) | Extractive Question-Answering | `(question, passage)`, `(question, answer)` | 4,990 |
17+
| *included in v2* |
18+
| [indonesian-nlp/lfqa_id](https://huggingface.co/datasets/indonesian-nlp/lfqa_id) | Open-domain Question-Answering | `(question, answer)` | 226,147 |
19+
| [jakartaresearch/indoqa](https://huggingface.co/datasets/jakartaresearch/indoqa) | Extractive Question-Answering | `(question, passage)`, `(question, answer)` | 6,498 |
20+
| [jakartaresearch/id-paraphrase-detection](https://huggingface.co/datasets/jakartaresearch/id-paraphrase-detection) | Paraphrase | `(sentence, rephrased sentence)` | 4,076 |
21+
| **Total** | | | **371,979** |
1822

1923
## All Supervised Datasets with MultipleNegativesRankingLoss
2024

training/all/all_datasets.py

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44

55
from datasets import load_dataset
66
from sentence_transformers import InputExample
7+
import numpy as np
78

89
##############
910
# PAIRS
@@ -78,6 +79,58 @@ def train_samples() -> List[InputExample]:
7879
return train_samples
7980

8081

82+
@dataclass
83+
class LFQAID:
84+
dataset = load_dataset("indonesian-nlp/lfqa_id", split="train", trust_remote_code=True)
85+
86+
@staticmethod
87+
def train_samples() -> List[InputExample]:
88+
train_samples = []
89+
90+
for datum in LFQAID.dataset:
91+
question = datum["title"]
92+
scores = datum["answers"]["score"]
93+
answer = datum["answers"]["text"][np.argmax(scores)]
94+
95+
train_samples.append(InputExample(texts=[question, answer]))
96+
97+
return train_samples
98+
99+
100+
@dataclass
101+
class IndoQA:
102+
dataset = load_dataset("jakartaresearch/indoqa", split="train", trust_remote_code=True)
103+
104+
@staticmethod
105+
def train_samples() -> List[InputExample]:
106+
train_samples = []
107+
108+
for datum in IndoQA.dataset:
109+
question = datum["question"]
110+
passage = datum["context"]
111+
answer = datum["answer"]
112+
113+
if question and passage and answer:
114+
train_samples.append(InputExample(texts=[question, passage]))
115+
train_samples.append(InputExample(texts=[question, answer]))
116+
117+
return train_samples
118+
119+
120+
@dataclass
121+
class ParaphraseDetection:
122+
dataset = load_dataset("jakartaresearch/id-paraphrase-detection", split="train", trust_remote_code=True)
123+
124+
@staticmethod
125+
def train_samples() -> List[InputExample]:
126+
train_samples = []
127+
128+
for datum in ParaphraseDetection.dataset:
129+
train_samples.append(InputExample(texts=[datum["sentence1"], datum["sentence2"]]))
130+
131+
return train_samples
132+
133+
81134
##############
82135
# TRIPLETS
83136
##############

training/all/train_all_mnrl.py

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,19 @@
66
from sentence_transformers import SentenceTransformer, InputExample, models, losses
77
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
88

9-
from all_datasets import IndoNLI, IndoStoryCloze, mMARCO, MIRACL, WReTE, IndoLEMNTP, TyDiQA, FacQA
9+
from all_datasets import (
10+
IndoNLI,
11+
IndoStoryCloze,
12+
mMARCO,
13+
MIRACL,
14+
WReTE,
15+
IndoLEMNTP,
16+
TyDiQA,
17+
FacQA,
18+
LFQAID,
19+
IndoQA,
20+
ParaphraseDetection,
21+
)
1022
from MultiDatasetDataLoader import MultiDatasetDataLoader
1123

1224

@@ -47,6 +59,9 @@ def main(args: Args):
4759
"SEACrowd/indolem_ntp": IndoLEMNTP,
4860
"khalidalt/tydiqa-goldp": TyDiQA,
4961
"SEACrowd/facqa": FacQA,
62+
"indonesian-nlp/lfqa_id": LFQAID,
63+
"jakartaresearch/indoqa": IndoQA,
64+
"jakartaresearch/id-paraphrase-detection": ParaphraseDetection,
5065
}
5166

5267
train_ds = [ds.train_samples() for ds in raw_datasets.values()]

0 commit comments

Comments
 (0)