Skip to content

Commit 4584ea2

Browse files
authored
Merge pull request #101 from databio/dev_search_doc
2 parents c5bfe2e + bf82892 commit 4584ea2

File tree

1 file changed

+72
-0
lines changed

1 file changed

+72
-0
lines changed

docs/geniml/tutorials/text2bednn-search-interface.md

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -105,5 +105,77 @@ query_dict = {
105105
MAP, AUC, RP = search_interface.eval(query_dict)
106106
```
107107

108+
## Hugging Face
108109

110+
### Model
109111

112+
`Vec2VecFNN` can be innitiated from a Hugging Face repository:
113+
114+
```python
115+
model = Vec2VecFNN("databio/v2v-sentencetransformers-encode")
116+
```
117+
118+
To upload the model onto huggingface, you can use `export` function to download the files of model(checkpoint.pt) and config(config.yaml).
119+
120+
```
121+
v2v_torch1.export("path/totarget/folder", "checkpoint.pt")
122+
```
123+
124+
Then upload both files with correct names onto the Hugging Face repository
125+
126+
### Dataset
127+
128+
`geniml.search.anecdotal_search_from_hf_data` can allow users to query with free-form natural language strings to search in a Hugging Face dataset. The dataset must have:
129+
130+
* hnsw index file of BED file embeddings (index.bin)
131+
* dictionary file of payloads (payloads.pkl). It must have file name stored with the key of "file". For example:
132+
133+
```
134+
# key of the payload is the storage index in the hnsw index
135+
{
136+
0: {
137+
"file": "Example.bed",
138+
...
139+
},
140+
...
141+
}
142+
```
143+
144+
* metadata file (metadata.json) in this format: `{<metadata attribute>: {<annotation text>: [<files>]}}`. For example:
145+
146+
```
147+
{
148+
"tissue": {
149+
"kidney": [
150+
"Example.bed",
151+
...
152+
],
153+
...
154+
},
155+
...
156+
}
157+
```
158+
159+
With the repo name of dataset, `Vec2VecFNN`, and repo name of the model that was used to encode training metadata, you can search through the dataset with any free-form query you type:
160+
161+
```python
162+
from geniml.search import anecdotal_search_from_hf_data
163+
import pprint
164+
165+
# vec2vec model
166+
search_repo = "databio/v2v-sentencetransformers-encode"
167+
# text encoder model
168+
text_repo = "sentence-transformers/all-MiniLM-L6-v2"
169+
# dataset
170+
data_repo = "databio/geo-hg38-search-test"
171+
result = anecdotal_search_from_hf_data(
172+
"glioblastoma",
173+
data_repo,
174+
search_repo,
175+
text_repo,
176+
10
177+
)
178+
179+
pprint.pprint(result)
180+
181+
```

0 commit comments

Comments
 (0)