Skip to content

Latest commit

 

History

History
38 lines (30 loc) · 2.32 KB

File metadata and controls

38 lines (30 loc) · 2.32 KB

Summer Research Machine Learning Project

Official Website : https://math.nickvlamis.com/sro/sro-ml

Paper-style Report and Poster : https://www.tiowu.com/arxiv-papers-classification

Research Question : Can supervised learning neural networks, using advanced embedding techniques (e.g. doc2vec), outperform the arXiv’s API in auto-classification?

Research Objective : Design and implement machine-learning algorithms using neural networks to auto-classify arXiv preprints and compare performance against, and potentially outperform, benchmarks provided by arXiv’s in-house ML team.

Project Summary: The arXiv is a preprint repository for research articles in several fields, including mathematics. When a preprint is submitted, the author must choose a category and subcategory—e.g. math.GT, where math is the category and GT (geometric topology) is the subcategory. The project is to design a machine learning algorithm using modern embedding techniques, based on neural networks, to take as input a preprint and output a corresponding primary category, hence auto-classifying preprints. The arXiv recently released an API for auto-classification using older, well-established techniques. One potential goal of the project is to use this API as the control and to see if more advanced techniques can improve performance. In the process, we will explore several attainable designs.

Learning Objectives : Students will learn the basic theory of neural networks, best practices in machine learning, vector embedding techniques, how to engineer various models, to work with industry-standard programming tools (including Python, TensorFlow, PyTorch, vscode, ssh), how to work with and manage big data, how to design/manage a large-scale project and the associated workflow, how to design experiments and test hypotheses, and how to work both in a team and independently.

Team Members

- Dr. Nicholas Vlamis, Faculty Mentor
- Kathy He
- FangFang (Daisy) Lyu
- Tao Wu

ArXiv Bulk Data: Kaggle.com via Google Cloud Storage buckets

ArXiv HuggingFace minor & major Datasets: Huggingface.co