aghenai.github.io/medical_lexicon.html at master · aghenai/aghenai.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
<head>
</head>
<h1>Catching Zika Fever: Application of Crowdsourcing and Machine Learning for Tracking Health Misinformation on Twitter - Medical lexicon Dataset</h1>
<h2>Amira Ghenai</h2>
<h3>University of Waterloo</h3>
<!--<h3><a href=mailto:aghenai@uwaterloo.ca>aghenai@uwaterloo.ca</a></h3> -->

This site provides the medical lexicon generated in the work presented in "Catching Zika Fever: Tracking Health Misinformation in Twitter".
<p>
<b><i>See also:</i> The full paper is <a href=https://arxiv.org/abs/1707.03778>available here</a></b>.

<p>
We compute the medical lexicon of `infectious disease' wikipedia pages using two different corpuses:
<ul>
<li>Medical corpus (all wikipedia pages about infectious disease) </li>
<li>Wikipedia corpus (top words in all wikipedia)</li>
</ul>
For every word in every corpus, we compute the probabilities as follows:
<ol>
<li> In the medical corpus, we compute the probability of every word as follows:
mp_w (medical corpus probability of word) = frequency of w in medical corpus / total number of words in medical corpus.
</li>
<li>In the wikipedia corpus, we compute the probability of every word as follows:
wp_w = frequency of w in wikipedia / total number of wikipedia words.
</li>
<li>Then, for every word w in both corpus:
<ul>
<li>If w is in the medical corpus but not in the wikipedia corpus or the opposite is true, p_w = 0.</li>
<li>if w is in both the medical and wikipedia corpus, p_w = mp_w - wp_w (medical corpus probability - wikipedia corpus probability)</li>
</ul>
</li>
<li>Finally, we sort words with p_w in descending order and pick the top words to form the corpus.</li>
</ol>

<br>
The set of medical and wikipedia lexicons may be downloaded here:
<ul>
<li><a href="https://aghenai.github.io/assets/medical_corpus/medical_corpus.txt">medical_corpus.txt</a>
<li><a href="https://aghenai.github.io/assets/medical_corpus/wikipedia_corpus.txt">wikipedia_corpus.txt</a>
</ul>

<br>
Both the medical_corpus.txt and the wikipedia_corpus.txt files contain 22123 words. Every word is in a separate line and every line is in the format of: WORD [TAB] FREQUENCY.
<br>

For more details, please read the paper.
<br>
<br>
Please cite the original publication when using the dataset:<br>
Amira Ghenai, Yelena Mejova, 2017, January. Catching Zika Fever: Application of Crowdsourcing and Machine Learning for Tracking Health Misinformation on Twitter. The Fifth IEEE International Conference on Healthcare Informatics (ICHI 2017), Park City, Utah.