Skip to content

Commit 911bec9

Browse files
committed
post
1 parent 2925025 commit 911bec9

File tree

5 files changed

+121
-9
lines changed

5 files changed

+121
-9
lines changed

docs/.DS_Store

0 Bytes
Binary file not shown.

docs/_analysis/.Rhistory

Whitespace-only changes.

docs/_analysis/analysis-5.md

Lines changed: 121 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -36,40 +36,152 @@ In this analysis, I used the [SingStat Table Builder](https://tablebuilder.sings
3636
## Preprocessing the Dataset
3737
### Outlier Analysis
3838

39-
COVID-19 caused heavy drop in retail sales. We created an indicator feature ['is_covid'] for the machine learning algorithm to identify the COVID-19 period as outliers, ensuring that the model is not misled.
39+
COVID-19 caused heavy drop in retail sales. We created an indicator feature ['is_covid'] for the machine learning algorithm to identify the COVID-19 period as outliers, ensuring that the model can account for these outliers.
4040

4141
{% highlight ruby %}
42+
###Adding COVID-19 indicator feature
43+
df['is_covid'] = df.index.to_series().between('2020-04', '2020-06').astype(int)
4244
{% endhighlight %}
4345

44-
![](/assets/images/wisconsin/corrplot.png)
45-
4646
### Feature Engineering
4747

4848
Creating time-step features: 'month', 'quarter' and 'year'.
4949

5050
{% highlight ruby %}
51+
#Time Step Features
52+
def create_features(df):
53+
"""
54+
Creates time-step features (month, quarter, year) from the DataFrame's index.
55+
56+
Args:
57+
df (pd.DataFrame): The input DataFrame with a DatetimeIndex.
58+
59+
Returns:
60+
pd.DataFrame: The DataFrame with added time-step features.
61+
"""
62+
df_copy = df.copy()
63+
64+
# df_copy['hour'] = df_copy.index.hour # Uncomment if needed
65+
# df_copy['dayofweek'] = df_copy.index.dayofweek # Uncomment if needed
66+
df_copy['month'] = df_copy.index.month
67+
df_copy['quarter'] = df_copy.index.quarter
68+
df_copy['year'] = df_copy.index.year
69+
# df_copy['dayofyear'] = df_copy.index.dayofyear # Uncomment if needed
70+
71+
return df_copy
72+
73+
df = create_features(df)
5174
{% endhighlight %}
5275

76+
With time-step features, we can do some visualisations to investigate signs of trends and/or seasonality
77+
78+
![monthboxplot](/assets/images/retailsalessg/monthboxplot.png)
79+
5380
We also created lag features: 'sales_lag_1' and 'sales_lag_2'
5481

5582
{% highlight ruby %}
83+
#Lag Features
84+
df['sales_lag_1'] = df['retail_sales_value_estimated'].shift(1)
85+
df['sales_lag_2'] = df['retail_sales_value_estimated'].shift(2)
5686
{% endhighlight %}
5787

58-
# Model Building
88+
We assessed that lag features were unsuitable and worsened prediction results instead.
5989

90+
# Model Building
6091
## Time Series Cross Validation
92+
93+
We used the TimeSeriesSplit module to conduct time series cross validation by creating n folds to assess the efficacy of using time series forecasting for retail sales values. This method helps to prevent lookahead bias by repeatedly evaluating a different set of 12 months towards the end of the dataset.
94+
95+
It should be worth considering that our dataset is quite small, so it will be expected to see natural improvements as 12 datarows worth of data are added after each fold.
96+
97+
We used parameters n_split = 3 & test_size = 12.
98+
6199
{% highlight ruby %}
100+
tss = TimeSeriesSplit(n_splits = 3, test_size = 12
101+
, gap = 0)
62102
{% endhighlight %}
63103

64-
![](/assets/images/wisconsin/rf_parameter.png)
104+
## XGBoost Regression
105+
We used the XGB Reg module to build our model.
65106

66107
{% highlight ruby %}
108+
tss = TimeSeriesSplit(n_splits = 3, test_size = 12
109+
, gap = 0)
110+
111+
fold = 0
112+
preds = []
113+
scores = []
114+
for train_idx, val_idx in tss.split(df):
115+
train = df.iloc[train_idx]
116+
test = df.iloc[val_idx]
117+
118+
train = create_features(train)
119+
test = create_features(test)
120+
121+
FEATURES = ['quarter', 'month', 'year','is_covid']
122+
TARGET = 'retail_sales_value_estimated'
123+
124+
X_train = train[FEATURES]
125+
y_train = train[TARGET]
126+
127+
X_test = test[FEATURES]
128+
y_test = test[TARGET]
129+
130+
reg = xgb.XGBRegressor(base_score=0.5, booster='gbtree',
131+
n_estimators=1000,
132+
early_stopping_rounds=50,
133+
objective='reg:linear',
134+
max_depth=3,
135+
learning_rate=0.01)
136+
reg.fit(X_train, y_train,
137+
eval_set=[(X_train, y_train), (X_test, y_test)],
138+
verbose=100)
139+
140+
y_pred = reg.predict(X_test)
141+
preds.append(y_pred)
142+
score = np.sqrt(mean_squared_error(y_test, y_pred))
143+
scores.append(score)
67144
{% endhighlight %}
68145

69-
## XGBoost Regression
70-
We used the XGB Reg module to build our model. We used parameters
146+
## Model initial evaluation
147+
The RMSE results of the folds were:
148+
Score across folds: 132.5939
149+
Fold scores:[147.9255505042701, 138.56629972923642, 111.28990440456352]
150+
151+
The image below shows the model's prediction against the actual data for dates past 2024.
152+
![prediction1year](/assets/images/retailsalessg/prediction1year.png)
153+
154+
As we can see, the model has generally done well to capture the monthly seasonality of retail sales in SG, capturing the shape without too much of an error.
155+
156+
# Prediction & Visualisation
157+
## Final Model & Forecasting
158+
159+
{% highlight ruby %}
160+
#Final Model Building
161+
162+
FEATURES = ['quarter', 'month', 'year', 'is_covid']
163+
TARGET = 'retail_sales_value_estimated'
164+
165+
X_full = df[FEATURES]
166+
y_full = df[TARGET]
167+
{% endhighlight %}
168+
169+
{% highlight ruby %}
170+
reg = xgb.XGBRegressor(
171+
base_score=0.5,
172+
booster='gbtree',
173+
n_estimators=1000,
174+
early_stopping_rounds=50,
175+
objective='reg:squarederror', # use this instead of 'reg:linear' (deprecated)
176+
max_depth=3,
177+
learning_rate=0.1
178+
)
179+
180+
reg.fit(X_full, y_full, eval_set=[(X_full, y_full)], verbose=100)
181+
{% endhighlight %}
71182

72-
# Overall Evaluation
73-
![](/assets/images/wisconsin/accuracy.png)
183+
## Tableau Visualisation
184+
We then visualised the actual & forecasted data in Tableau.
74185

75186
# Conclusion
187+
Using machine learning models to predict retail sales forecasting showed some promise as seen in the results of the time series cross validation. It is worth acknowledging that this dataset is lacking in volume, both in terms of depth and breadth. There could be more data points, i.e. using weekly or even daily sales for analysis, and there could be more exogenous variables such as promotions, holidays, etc.
19.2 KB
Loading
66.7 KB
Loading

0 commit comments

Comments
 (0)