-
Notifications
You must be signed in to change notification settings - Fork 31
Integrated oversampling
Sequence classification, also known as time series classification, concerns the classification of a sequence of features. Unlike standard classification, the ordering of the feature observations matters. Such sequences are common in applications such as natural language processing, gene sequencing and algorithmic finance. When the labels are imbalanced, the conventional approach in non-sequence classification problems is to simply undersample the majority class or oversample the minority class. Techniques, such as SMOTE, go further and try to add more weight to samples close to the decision boundary. However, these techniques are not suited to time series classification. The purpose of this project will be to implement a R package for oversampling the minority class while preserving the covariance structure of the panel data.
Hong Cao ; Xiao-Li Li ; David Yew-Kwong Woon ; See-Kiong Ng, "Integrated Oversampling for Imbalanced Time Series Classification", IEEE Transactions on Knowledge and Data Engineering ( Volume: 25, Issue: 12, Dec. 2013 )
This project complements existing efforts and potential new projects to make available Tensorflow in R. Tensorflow provides an implementation of LSTMs, a RNN method well suited to time series classification, and an integrated oversampling package will support the application of LSTms and other RNNs to real world problems plagued by class imbalance.
Each project needs 2 mentors. One should be an expert R programmer with previous package development experience, and the other can be a domain expert in some other field or application area (optimization, bioinformatics, machine learning, data viz, etc). Ideally one of the two mentors should have previous experience with GSOC (either as a student or mentor).
Several tests that potential students can do to demonstrate their capabilities for this particular project. Ask some hard questions that will give you insight about how the students write code to solve problems. You’ll see that the harder the questions that you ask, the easier it will be for you to choose between the students that apply for your project! Please modify the suggestions below to make them specific for your project.
Easy: something that any useR should be able to do, e.g. download some existing package listed in the Related Work, and run it on some example data. Medium: something a bit more complicated. You can encourage students to write a script or some functions that show their R coding abilities. Hard: Can the student write a package with Rd files, tests, and vigettes? If your package interfaces with non-R code, can the student write in that other language? Solutions of tests