This project is part of the Udacity Azure ML Nanodegree. In this project, we build and optimize an Azure ML pipeline using the Python SDK and a provided Scikit-learn model. This model is then compared to an Azure AutoML run.
- ScriptRunConfig Class
- Configure and submit training runs
- HyperDriveConfig Class
- How to tune hyperparamters
In 1-2 sentences, explain the problem statement: e.g "This dataset contains data about... we seek to predict..."
The Data is mostly information about results from an advertisment campaign and the details of each potential customer and the various background details on them. I was seeking to determine if there was a way to predict the 'loan' column which to me seems to indicate if the campaign was successful, it likely resulted in a loan being made by an individual with specfic attribute, so which of those attributes are the most important.
In 1-2 sentences, explain the solution: e.g. "The best performing model was a ..."
So far the HyperParameter model accuracy was the highest providing an accuracy value of 0.9089 percent which is definetetly good. The Automl model while working was only able to get as high as 0.84819, There was some alert statuses that suggested the classes balancing could becauseing some issues in its ability to further refine the model. But it seems that the Hyperparmeter solution worked best
Explain the pipeline architecture, including data, hyperparameter tuning, and classification algorithm. The Pipeline was primarily a notebook compute allocation that was used to generate a seperate compute instance to be used for both the Hyperparameter configuration Runs and the AutoML runs. The data was sanitized using the provided cleaning function and I seperated the data into Testing(%20) and Training(80%) sets. The HyperParmeter config used the Standard LogisticRegression algorithm provided by Sci-kit learn with changeable parameters.
What are the benefits of the parameter sampler you chose?
For the Hyperparameter tuning, I used the RandomParameterSampling class that allows for providing several combinations of inputs for the test runs to ultimately arrive at the best combination of inputs with respect to accuracy. The main benifit is you get to provide a list of inputs or use functions like the uniform function which can provide a uniform distribution of values given a start and an end point. This saves alot of time rather than changing these by hand for individual runs.
What are the benefits of the early stopping policy you chose?
Going with the Bandit Policy allowed for the job to continously monitor the primary metric at specific intervals, if there is any issues with a run that is not within a certain threshold of the slack factor, it gets terminated to save on additional resources if a run is not going to make the particular threshold.
In 1-2 sentences, describe the model and hyperparameters generated by AutoML. The model was ultimaely a variant of the standardRegression (xgboost_classifier) model that seemed to perform comparably to the HyperParmeter runs but ultimately only acheived 84% accuracy versus the 91% from the Hyperparameters run. I used a relatively high cross-validation value of 5 which could have taken some time from enhancing the models accuracy further. But overall it did fairly well.
Compare the two models and their performance. What are the differences in accuracy? In architecture? If there was a difference, why do you think there was one?
There was not much difference over all in the performance and they both provided aggreeable results. The Architecture was also a simialr Linear Regression type model which seems to be the best approach with this type of data.
What are some areas of improvement for future experiments? Why might these improvements help the model? I think if I was able to better see the results I could have dug a little deeper but the libraries in the AzureMLStudio were not able to use the DisplayResults function. But other than that perhaps playing with the cross validation numbers or providing a better set of balanced testing and training data might have resulted in slightly more accuracy. (But could also increase overfitting)
If you did not delete your compute cluster in the code, please complete this section. Otherwise, delete this section. Image of cluster marked for deletion