This project was my Senior design project for SE491 and SE492 Iowa State.
We created a POS tagger specifically for Software Documentation by adding a new tag set, getting new training data, and using various models - ultimately ending on a CRF to get the best results.
Check out the project poster for the quickest way to learn about the project.
The project website is available here
My role on the project was Computational Linguistics SME - I would often be the lead designer of our approach to problems like tag set, which model to use and with what parameters.
This was done in collaboration with:
- Ahmad Alramahi, [email protected]
- Austin Boling, [email protected]
- Joseph Naberhaus, [email protected]
- Ekene Okeke, [email protected]
- Ethan Ruchotzke, [email protected]
Autotagging contains the auto tagger module used for automatically tagging JSON formatted datafiles. The autotagger will automatically tag any obvious missing tags (pure english, numbers, etc) and leave the rest for manual tagging.
A small set of utilities used to check out consistency between the python and java versions of stanford NLP (stanza vs corenlp). This is not useful for the main project.
The data directory contains all of the data used for training / our iteration of work. Here you can see examples of each type of data used for our first iteration of the project.
A small module used to parse JSON formatted data into NLP formatted data. This module was not used, and was scrapped in favor of the converter in NLPModel. This is not recommended for use.
A small parser which attempted to parse javadoc HTML directly. Overall, this is a deprecated library, and was scrapped in favor of the Universal HTML parser.
The manual tagger is a javafx gui used to tag untagged data in a directory full of JSON datafiles. For any manual tagging, this tool will be of good use. Additionally, the patcher is located here, which can tag all of a given token with a specified tag.
The NLP model contains all tagging, training, and grading utilities for the pipeline. See the associated readme for more information on how to use this project.
A small temporary directory full of files. The live-parser resides here, but is not really useful (it was an offline version of the coreNLP online parser).
The tokenization module for the pipeline. This is a java module capable of tokenizing and sentence splitting plaintext HTML input. The tokenization module is one of the more important modules, and is used within the universal HTML parser.
The training directory contains all of the data used for training the model, as well as the current iteration of the model. This directory is referenced by the NLP model application.
The universal HTML parser is located here, and is capable of taking in URL inputs and spitting out tokenized JSON data for training / tagging. See the associated readme for more information.