By Jennifer Hung & Athulith Paraselli
As college students who are both working and studying at the same time, finding time to cook can be difficult and time-consuming in itself. We wanted to see what variables affect how long it takes for a dish to be made, and perhaps help other students take into consideration when allocating time to prepare a meal. Thus, we ask the question of what predicts the number of minutes per recipe?
Our dataset is scraped from food.com, and we went through the same cleaning process and exploratory data analysis as we did in Project 3 (Click here to read!) We cleaned two additional columns:
- Turned
tagsinto lists of strings because it was initially allstrvalues, - Changed
submittedinto datetime for easier analysis.
We are trying to predict minutes it takes to prepare a recipe, which would be a type of regression problem as we want to figure out what variables do number of minutes depend on.
Our response variable is the number of minutes per recipe, and we chose
We used information such as number of steps (n_steps), calories (calories), tags (tags with labels such as "easy", "for large groups" etc.), and number of ingredients (n_ingredients). These are all information that can be acquired before calculating how long it takes to make a dish as you need the ingredients and steps for a recipe thought out before making it. Other columns such as ratings or reviews should be noted after minutes were recorded, so it doesn’t make sense to use them for prediction.
We started off with a very basic model that used 2 features and fitted to a Linear Regression model.
n_stepsis a quantitative, nominal feature (we are not ranking anything better because it has more steps)caloriesis a continuous quantitative, nominal feature, as having more calories does not necessarily make it have more minutes to prepare a dish.
We mapped calories by negative square root function because when we plotted it, it appeared to be shaped in either a negative logarithmic or negative square root function. We chose to shape it to negative square root because minutes included values of 0, which would result in NaN when passed through log.
When we plotted n_steps against minutes, it appeared to be a fairly even distribution, thus in our baseline model we did not make any changes to that column.
Our baseline model reported an 0.0007803710374110207, which means it is not the best model as it can only explain less than 0.001% of the variance. The
Our final model had a total of 4 features and was fitted to a Decision Tree Regressor model.
- Standard scaling of
n_stepsandn_ingredients(both quantitative, nominal features) because we wanted to ensure the evenness of distribution of values through standardization. - For
calories(continuous quantitative, nominal feature) we quantile transformed the data as we knew that calories had extreme outliers (ranging from 0 to ~45k). Thus, we wanted to use QuantileTransformer to reduce the impact of outliers. - Performed One-Hot Encoding with the
tagscolumn (all qualitative, categorical values) through CountVectorizer. We wanted to account for the meaningfulness of certain tags (ex: we predicted that tags that have the word"easy"may result in shorter cooking time, while tags like“for larger groups”may be longer because it prepares more food.) CountVectorizer performs a bag of words encoding, which returns something similar to One-Hot Encoding for our list of words.
After analyzing the 568 unique tags we ended using the following 8 tags: for-large-groups, for-large-groups-holiday-event, for-1-or-2, desserts-easy, easy, no-cook, snack, snacks-kid-friendly. If the recipe had the tag, it was labeled 1 for that tag; if not, it was labeled 0.
We tried two modeling algorithms: Linear Regression and Decision Tree Regressor. We selected Decision Tree Regressor for its higher
For our hyperparamter selection, we manually ran the training and test set data iteratively to find the best max depth and criterion. We identified the criterion as friedman_mse as it perfomed a bit better on average (as seen in Figure 1), while the best max-depth as 5 at the cutoff of the graph below (Figure 2).
(Figure 1)
| Max_Depth | squared_error | friedman_mse |
|---|---|---|
| 1 | -0.0209999 | -0.0209999 |
| 2 | -0.0752337 | -0.0752337 |
| 3 | -0.235588 | -0.235588 |
| 4 | -2.70197 | -2.70197 |
| 5 | 0.589808 | 0.589811 |
| 6 | 0.588609 | 0.588609 |
| 7 | 0.576559 | 0.576562 |
| 8 | 0.605176 | 0.606451 |
| 9 | 0.593303 | 0.591631 |
| 10 | 0.517932 | 0.524379 |
(Figure 2)
<iframe src="assets/OptimalHyperParams.html" width=800 height=600 frameBorder=0></iframe>Our Decision Tree Regressor with a max-depth of 5 had an 0.07029746244938351. This value is still far from a good fit of 1, which indicates our Final Model still has room for improvement.
In comparison to our Baseline Model's 0.0007803710374110207, the Final Model's
For our Fairness Analysis, we checked to see if there was a difference in model performance in two groups:
Group X: Recipes created before and including the year 2014.
Group Y: Recipes created after and excluding the year 2014.
For our evaluation metric, we used
Null Hypothesis: Our model is fair, its
Alternative Hypothesis: Our model is unfair. There is a difference between the
The test statistic we selected was the absolute difference of 414.8822267940934
(Figure 3)
<iframe src="assets/PermutationTest.html" width=800 height=600 frameBorder=0></iframe>From our graph above, we see that our p-value equates to 0.826, which means we fail to reject the null at the significance level of 0.05.
We can tentatively conclude that there is no difference in model performance between recipes created before and after 2014. However, we cannot 100% prove that it is completely fair across both groups.