You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/_analysis/analysis-5.md
+121-9Lines changed: 121 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -36,40 +36,152 @@ In this analysis, I used the [SingStat Table Builder](https://tablebuilder.sings
36
36
## Preprocessing the Dataset
37
37
### Outlier Analysis
38
38
39
-
COVID-19 caused heavy drop in retail sales. We created an indicator feature ['is_covid'] for the machine learning algorithm to identify the COVID-19 period as outliers, ensuring that the model is not misled.
39
+
COVID-19 caused heavy drop in retail sales. We created an indicator feature ['is_covid'] for the machine learning algorithm to identify the COVID-19 period as outliers, ensuring that the model can account for these outliers.
We assessed that lag features were unsuitable and worsened prediction results instead.
59
89
90
+
# Model Building
60
91
## Time Series Cross Validation
92
+
93
+
We used the TimeSeriesSplit module to conduct time series cross validation by creating n folds to assess the efficacy of using time series forecasting for retail sales values. This method helps to prevent lookahead bias by repeatedly evaluating a different set of 12 months towards the end of the dataset.
94
+
95
+
It should be worth considering that our dataset is quite small, so it will be expected to see natural improvements as 12 datarows worth of data are added after each fold.
As we can see, the model has generally done well to capture the monthly seasonality of retail sales in SG, capturing the shape without too much of an error.
155
+
156
+
# Prediction & Visualisation
157
+
## Final Model & Forecasting
158
+
159
+
{% highlight ruby %}
160
+
#Final Model Building
161
+
162
+
FEATURES = ['quarter', 'month', 'year', 'is_covid']
163
+
TARGET = 'retail_sales_value_estimated'
164
+
165
+
X_full = df[FEATURES]
166
+
y_full = df[TARGET]
167
+
{% endhighlight %}
168
+
169
+
{% highlight ruby %}
170
+
reg = xgb.XGBRegressor(
171
+
base_score=0.5,
172
+
booster='gbtree',
173
+
n_estimators=1000,
174
+
early_stopping_rounds=50,
175
+
objective='reg:squarederror', # use this instead of 'reg:linear' (deprecated)
We then visualised the actual & forecasted data in Tableau.
74
185
75
186
# Conclusion
187
+
Using machine learning models to predict retail sales forecasting showed some promise as seen in the results of the time series cross validation. It is worth acknowledging that this dataset is lacking in volume, both in terms of depth and breadth. There could be more data points, i.e. using weekly or even daily sales for analysis, and there could be more exogenous variables such as promotions, holidays, etc.
0 commit comments