Skip to content

Miniproject: Machine Learning

Chaitanya Sharma edited this page Jun 18, 2021 · 21 revisions

Supervised and Unsupervised Text Classification.

1. We created sections using ami3 which look like this

https://github.com/petermr/openDiagram/blob/master/physchem/resources/oil26/PMC5454990/sections/2_back/0_ack.xml

<?xml version="1.0" encoding="UTF-8"?>
<ack>
 <title>Acknowledgments</title>
 <p>The authors are grateful to CNPq-Programa “Ciências sem fronteiras” (Grant No. 233761/2014-4) for financial support.</p>
</ack>

2. Flattening the XML into text format for readability.

https://github.com/petermr/openDiagram/blob/master/physchem/resources/oil26/PMC5454990/sections/2_back/0_ack.txt

3. We have 2 major problem statements:

  • Not all sections are labelled with universally accepted vocabulary.
  • We want to improve our knowledge resource by clustering together similar articles on a paragraph or section basis. E.g. Using unsupervised learning we find out that gas chromatography is a frequently used phrase, we use it as a label to group together other articles that mention gas chromatography.

4. Goals

  • We plan on extracting keywords and phrases using NLTK rake. We create a bag of words and tf-idf representation of the data. We manually agree on the labels we want to use for topic modelling. https://en.wikipedia.org/wiki/Tf%E2%80%93idf
  • We want to work with different tools and libraries in python and discover the tools which serves our purpose best.

5. Tools:

  • Scikit-learn clustering models
  • gensim
  • countvectorizer
  • tf-idf
  • LDA
  • cosine similarity
  • spacy

Clone this wiki locally