In this project, inspired by the world of Harry Potter, we took on the role of data scientists tasked with creating a machine learning model to replace the malfunctioning Sorting Hat. Using logistic regression, we must classify students into the four Hogwarts houses based on their academic performance. Objectives: Learning the Fundamentals of Logistic Regression
- Data Exploration – Understanding, analyzing, and cleaning datasets.
- Data Visualization – Using histograms, scatter plots, and pair plots to identify patterns.
- Logistic Regression – Implementing a multi-class classification model using the one-vs-all strategy.
- Model Training & Prediction – Writing custom code to train a model using gradient descent and make predictions.
Before building a model, understanding the dataset is crucial. We were required to:
- Examine the dataset’s structure.
- Compute basic statistical properties (count, mean, std, min, max) without using pre-built functions like Pandas’ describe().
Visualizing data helps in feature selection and detecting anomalies. We must
- Histograms to assess score distributions across
- Scatter plots to compare features and detect
- Pair plots to analyze relationships between multiple features.
The heart of the project is implementing logistic regression for classification. We had to:
Train a one-vs-all classifier using gradient descent to optimize weights.
Create two programs:
logreg_train to train the model and store weights.
logreg_predict to classify new students and generate a prediction file.
Bonus: Enhancing the Model
- Implementing stochastic gradient descent or other optimization techniques.
- Expanding statistical analysis with more descriptive features.