Summer Research Machine Learning Project

Official Website : https://math.nickvlamis.com/sro/sro-ml

Paper-style Report and Poster : https://www.tiowu.com/arxiv-papers-classification

Research Question : Can supervised learning neural networks, using advanced embedding techniques (e.g. doc2vec), outperform the arXiv’s API in auto-classification?

Research Objective : Design and implement machine-learning algorithms using neural networks to auto-classify arXiv preprints and compare performance against, and potentially outperform, benchmarks provided by arXiv’s in-house ML team.

Project Summary: The arXiv is a preprint repository for research articles in several fields, including mathematics. When a preprint is submitted, the author must choose a category and subcategory—e.g. math.GT, where math is the category and GT (geometric topology) is the subcategory. The project is to design a machine learning algorithm using modern embedding techniques, based on neural networks, to take as input a preprint and output a corresponding primary category, hence auto-classifying preprints. The arXiv recently released an API for auto-classification using older, well-established techniques. One potential goal of the project is to use this API as the control and to see if more advanced techniques can improve performance. In the process, we will explore several attainable designs.

Learning Objectives : Students will learn the basic theory of neural networks, best practices in machine learning, vector embedding techniques, how to engineer various models, to work with industry-standard programming tools (including Python, TensorFlow, PyTorch, vscode, ssh), how to work with and manage big data, how to design/manage a large-scale project and the associated workflow, how to design experiments and test hypotheses, and how to work both in a team and independently.

Team Members

- Dr. Nicholas Vlamis, Faculty Mentor
- Kathy He
- FangFang (Daisy) Lyu
- Tao Wu

ArXiv Bulk Data: Kaggle.com via Google Cloud Storage buckets

ArXiv HuggingFace minor & major Datasets: Huggingface.co

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
dataset_exploration		dataset_exploration
docs		docs
main		main
results_analysis		results_analysis
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Summer Research Machine Learning Project

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Summer Research Machine Learning Project

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages