-
Notifications
You must be signed in to change notification settings - Fork 100
Home
Qingyu Chen edited this page May 1, 2019
·
14 revisions
Unfortunately, we cannot provide the corpora due to the copyrights. The PubMed abstracts can be downloaded from https://www.ncbi.nlm.nih.gov/pubmed. The MIMIC-III Clinical Database can be downloaded from https://physionet.org/works/MIMICIIIClinicalDatabase/access.shtml.
The BioWordVec is in the binary word2vec C format. One way to read the model is using gensim
. The following example is copied from their website.
To use BioWordVec vector:
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format(filename, binary=True)
To use BioWordVec model:
from gensim.models import FastText
model = FastText.load_fasttext_format(filename)
'''
The BioWordVec is built upon [sent2vec](https://github.com/epfml/sent2vec). To infer sentence embeddings, please see the `Directly from python` section. The following example is copied from their website,
```python
import sent2vec
model = sent2vec.Sent2vecModel()
model.load_model('model.bin')
emb = model.embed_sentence("once upon a time .")
embs = model.embed_sentences(["first sentence .", "another sentence"])
The preprocessing methods can be found in the src
folder. In general, the text was first tokenized using NLTK and then lowercased.
The bash scripts can be found in the src
folder.
@article{chen2018biosentvec,
title={BioSentVec: creating sentence embeddings for biomedical texts},
author={Chen, Qingyu and Peng, Yifan and Lu, Zhiyong},
journal={arXiv preprint arXiv:181302},
year={2018}
}