Skip to content

tiowu/ArXiv_Papers_Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Summer Research Machine Learning Project

Official Website : https://math.nickvlamis.com/sro/sro-ml

Paper-style Report and Poster : https://www.tiowu.com/arxiv-papers-classification

Research Question : Can supervised learning neural networks, using advanced embedding techniques (e.g. doc2vec), outperform the arXiv’s API in auto-classification?

Research Objective : Design and implement machine-learning algorithms using neural networks to auto-classify arXiv preprints and compare performance against, and potentially outperform, benchmarks provided by arXiv’s in-house ML team.

Project Summary: The arXiv is a preprint repository for research articles in several fields, including mathematics. When a preprint is submitted, the author must choose a category and subcategory—e.g. math.GT, where math is the category and GT (geometric topology) is the subcategory. The project is to design a machine learning algorithm using modern embedding techniques, based on neural networks, to take as input a preprint and output a corresponding primary category, hence auto-classifying preprints. The arXiv recently released an API for auto-classification using older, well-established techniques. One potential goal of the project is to use this API as the control and to see if more advanced techniques can improve performance. In the process, we will explore several attainable designs.

Learning Objectives : Students will learn the basic theory of neural networks, best practices in machine learning, vector embedding techniques, how to engineer various models, to work with industry-standard programming tools (including Python, TensorFlow, PyTorch, vscode, ssh), how to work with and manage big data, how to design/manage a large-scale project and the associated workflow, how to design experiments and test hypotheses, and how to work both in a team and independently.

Team Members

- Dr. Nicholas Vlamis, Faculty Mentor
- Kathy He
- FangFang (Daisy) Lyu
- Tao Wu

ArXiv Bulk Data: Kaggle.com via Google Cloud Storage buckets

ArXiv HuggingFace minor & major Datasets: Huggingface.co

About

summer research on NLP & LLM using neural networks to compete with arXiv experimental classifiers

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors