Skip to content
Yifan Peng edited this page Nov 6, 2018 · 14 revisions

FAQs

Will you make source corpora available?

Unfortunately, we cannot provide the corpora due to the copyrights. The PubMed abstracts can be downloaded from https://www.ncbi.nlm.nih.gov/pubmed. The MIMIC-III Clinical Database can be downloaded from https://physionet.org/works/MIMICIIIClinicalDatabase/access.shtml.

How to use the BioWordVec and BioSentVec model?

The BioWordVec is in the binary word2vec C format. One way to read the model is using gensim. The following example is copied from their website,

from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format(filename, binary=True)

The BioWordVec is built upon sent2vec. To infer sentence embeddings, please see the Directly from python section. The following example is copied from their website,

import sent2vec
model = sent2vec.Sent2vecModel()
model.load_model('model.bin')
emb = model.embed_sentence("once upon a time .") 
embs = model.embed_sentences(["first sentence .", "another sentence"])

Where can I find the code to preprocess the text?**

The preprocessing methods can be found in the src folder. In general, the text was first tokenized using NLTK and then lowercased.

Where can I find the code to generate the models?**

The bash scripts can be found in the src folder.

How do I cite BioSentVec?

@article{chen2018biosentvec,
  title={BioSentVec: creating sentence embeddings for biomedical texts},
  author={Chen, Qingyu and Peng, Yifan and Lu, Zhiyong},
  journal={arXiv preprint arXiv:181302},
  year={2018}
}
Clone this wiki locally