Skip to content

Commit 33db56e

Browse files
ayushdgvinay-raman
andauthored
fixed github issue 677 (#682) (#687)
* fixed github issue 677 * added docs for hard-negative-mining * minor changes * fixed multi-gpu error --------- Signed-off-by: viraman <viraman@nvidia.com> Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> Co-authored-by: vinay-raman <98057837+vinay-raman@users.noreply.github.com>
1 parent 1acb412 commit 33db56e

File tree

3 files changed

+26
-4
lines changed

3 files changed

+26
-4
lines changed

tutorials/nemo-retriever-synthetic-data-generation/README.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -109,3 +109,21 @@ The choice of the embedding model is provided in the default configuration. We e
109109
For Answerability Filter, our recommendation is to go with the choice provided in the default configuation file. We confirmed that the checkbox-style prompt in the default configuration worked well for valid question filtering.
110110

111111
However, the framework is flexible of the choice of LLM-as-a-Judge and different LLMs with different prompt templates might work better for certain use cases. You can also experiment with Likert-scale prompting if need be.
112+
113+
## Hard Negative Mining:
114+
Hard-negative mining involves two steps. First step is to repartition the dataset into semantically similar documents. This is done using the following script,
115+
```
116+
python tutorials/nemo-retriever-synthetic-data-generation/repartition.py \
117+
--api-key=<API Key> \
118+
--input-dir=tutorials/nemo-retriever-synthetic-data-generation/sample_data/hard-neg-mining\
119+
--hard-negative-mining-config=tutorials/nemo-retriever-synthetic-data-generation/config/hard-negative-mining-config.yaml
120+
--output-dir=tutorials/nemo-retriever-synthetic-data-generation/my_clustered_dataset_dir
121+
```
122+
Once, the semantic clusters have been created, one can perform the hard negative mining as follows,
123+
```
124+
python tutorials/nemo-retriever-synthetic-data-generation/mine_hard_negatives.py \
125+
--api-key=<API Key> \
126+
--input-dir=tutorials/nemo-retriever-synthetic-data-generation/my_clustered_dataset_dir\
127+
--hard-negative-mining-config=tutorials/nemo-retriever-synthetic-data-generation/config/hard-negative-mining-config.yaml
128+
--output-dir=tutorials/nemo-retriever-synthetic-data-generation/my_mined_dataset_dir
129+
```

tutorials/nemo-retriever-synthetic-data-generation/mine_hard_negatives.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ def main():
3636
"--input-dir",
3737
type=str,
3838
default="",
39-
help="Input dir path containing annotated data files in jsonl format",
39+
help="Input dir path containing annotated data files in jsonl format (with extension .part)",
4040
)
4141
parser.add_argument(
4242
"--hard-negative-mining-config",

tutorials/nemo-retriever-synthetic-data-generation/retriever_hardnegative_miner.py

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,12 +14,12 @@
1414

1515
import importlib
1616
import itertools
17+
from typing import TYPE_CHECKING
1718

1819
import numpy as np
1920
import pandas as pd
2021
from dask.base import normalize_token
2122
from openai import OpenAI
22-
from sentence_transformers import SentenceTransformer
2323

2424
from nemo_curator import ClusteringModel
2525
from nemo_curator.datasets import DocumentDataset
@@ -30,13 +30,18 @@
3030
)
3131
RetrieverHardNegativeMiningConfig = config.RetrieverHardNegativeMiningConfig
3232

33+
if TYPE_CHECKING:
34+
from sentence_transformers import SentenceTransformer
35+
3336

3437
def create_nim_client(base_url, api_key):
3538
openai_client = OpenAI(base_url=base_url, api_key=api_key)
3639
return openai_client
3740

3841

39-
def create_hf_model(model_name_or_path):
42+
def create_hf_model(model_name_or_path: str) -> "SentenceTransformer":
43+
from sentence_transformers import SentenceTransformer
44+
4045
return SentenceTransformer(model_name_or_path, trust_remote_code=True)
4146

4247

@@ -167,7 +172,6 @@ def _groupby_question(self, pdf):
167172
def __call__(self, dataset: DocumentDataset) -> DocumentDataset:
168173

169174
df = dataset.df
170-
df = df.to_backend("pandas")
171175
df = df[["question", "documents"]]
172176
df = df.map_partitions(self._groupby_question).reset_index()
173177
print("Number partitions in dataset = {}".format(df.npartitions))

0 commit comments

Comments
 (0)