Miniproject: Machine Learning

Supervised and Unsupervised Text Classification.

1. We created sections using `ami3` which look like this

https://github.com/petermr/openDiagram/blob/master/physchem/resources/oil26/PMC5454990/sections/2_back/0_ack.xml

<?xml version="1.0" encoding="UTF-8"?>
<ack>
 <title>Acknowledgments</title>
 <p>The authors are grateful to CNPq-Programa “Ciências sem fronteiras” (Grant No. 233761/2014-4) for financial support.</p>
</ack>

2. Flattening the `XML` into text format for readability.

https://github.com/petermr/openDiagram/blob/master/physchem/resources/oil26/PMC5454990/sections/2_back/0_ack.txt

3. We have 2 major problem statements:

Not all sections are labelled with universally accepted vocabulary.
We want to improve our knowledge resource by clustering together similar articles on a paragraph or section basis. E.g. Using unsupervised learning we find out that gas chromatography is a frequently used phrase, we use it as a label to group together other articles that mention gas chromatography.

4. Goals

We plan on extracting keywords and phrases using NLTK rake. We create a bag of words and tf-idf representation of the data. We manually agree on the labels we want to use for topic modelling. https://en.wikipedia.org/wiki/Tf%E2%80%93idf
We want to work with different tools and libraries in python and discover the tools which serves our purpose best.

5. Tools:

Scikit-learn clustering models
gensim
countvectorizer
tf-idf
LDA
cosine similarity
spacy

Miniproject: Machine Learning

Supervised and Unsupervised Text Classification.

1. We created sections using ami3 which look like this

2. Flattening the XML into text format for readability.

3. We have 2 major problem statements:

4. Goals

5. Tools:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

1. We created sections using `ami3` which look like this

2. Flattening the `XML` into text format for readability.