This repo holds code analyzing MOOC dropout data from EdX using both RNN LSTM and an ensemble of other machine learning models.
Before running feature creation, write a deadline file with deadlines.py.
Feature creation is found in ensemble_features.py.
Change the course names at the top of the file to analyze other courses.
Run clean_ensemble_input.py to write cleaned csv's from features.
The course name at the top must be changed for each file.
Run run_ensemble.py. Set test and train courses with testdata and traindata.
With results of run_ensemble.py in memory, you can combine output results with combine_results.py.
collect_data_lstm.py contains code for extraction of events from log to training and testing data run_lstm_util contains helper functions that train the model and store weights.
To collect data in the proper format for RNN anaysis, import collect_data_lstm.py
and it will write users for training to 'course_users/' + course_name + '_users.pickle'
and users for testing to 'course_users/' + course_name + '_users_full.pickle'.
and run 'get_events_from_folder_name_generic' for each course
and run 'get_event_streams_train' to generate train data for certification model
and run 'get_event_streams_test' to generate test data for certification model
and run 'attritionLabels' to generate all data for attrition model
Tune the LSTM through run_lstm_util.py.
Run the LSTM through run_lstm.py.